<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="review-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMES</journal-id>
<journal-id journal-id-type="nlm-ta">CMES</journal-id>
<journal-id journal-id-type="publisher-id">CMES</journal-id>
<journal-title-group>
<journal-title>Computer Modeling in Engineering &#x0026; Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1526-1506</issn>
<issn pub-type="ppub">1526-1492</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">28130</article-id>
<article-id pub-id-type="doi">10.32604/cmes.2023.028130</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Review</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Deep Learning Applied to Computational Mechanics: A Comprehensive Review, State of the Art, and the Classics</article-title>
<alt-title alt-title-type="left-running-head">Recent Advances of Deep Learning in Geological Hazard Forecasting</alt-title>
<alt-title alt-title-type="right-running-head">Recent Advances of Deep Learning in Geological Hazard Forecasting</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Vu-Quoc</surname><given-names>Loc</given-names></name><xref ref-type="aff" rid="aff-1">1,&#x2709;</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Humer</surname><given-names>Alexander</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<aff id="aff-1"><label>1,&#x2709;</label><institution>Aerospace Engineering, University of Illinois at Urbana-Champaign</institution>, <addr-line>IL 61801</addr-line>, <country>USA</country> &#x25CF;&#x2709;<email>vql@illinois.edu</email></aff>
<aff id="aff-2"><label>2</label><institution>Institute of Technical Mechanics, Johannes Kepler University</institution>, <addr-line>A-4040 Linz</addr-line>, <country>Austria</country> &#x25CF;<email>alexander.humer@jku.at</email></aff>
</contrib-group>
<pub-date date-type="collection" publication-format="electronic">
<year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>28</day><month>6</month><year>2023</year></pub-date>
<volume>137</volume>
<issue>2</issue>
<fpage>1069</fpage>
<lpage>1343</lpage>
<history>
<date date-type="received"><day>01</day><month>12</month><year>2022</year></date>
<date date-type="accepted"><day>01</day><month>3</month><year>2023</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Vu-Quoc and Humer.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Vu-Quoc and Humer</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMES_28130.pdf"></self-uri>
<abstract>
<p>Three recent breakthroughs due to AI in arts and science serve as motivation: An award winning digital image, protein folding, fast matrix multiplication. Many recent developments in artificial neural networks, particularly deep learning (DL), applied and relevant to computational mechanics (solid, fluids, finite-element technology) are reviewed in detail. Both hybrid and pure machine learning (ML) methods are discussed. Hybrid methods combine traditional PDE discretizations with ML methods either (1) to help model complex nonlinear constitutive relations, (2) to nonlinearly reduce the model order for efficient simulation (turbulence), or (3) to accelerate the simulation by predicting certain components in the traditional integration methods. Here, methods (1) and (2) relied on Long-Short-Term Memory (LSTM) architecture, with method (3) relying on convolutional neural networks. Pure ML methods to solve (nonlinear) PDEs are represented by Physics-Informed Neural network (PINN) methods, which could be combined with attention mechanism to address discontinuous solutions. Both LSTM and attention architectures, together with modern and generalized classic optimizers to include stochasticity for DL networks, are extensively reviewed. Kernel machines, including Gaussian processes, are provided to sufficient depth for more advanced works such as shallow networks with infinite width. Not only addressing experts, readers are assumed familiar with computational mechanics, but not with DL, whose concepts and applications are built up from the basics, aiming at bringing first-time learners quickly to the forefront of research. History and limitations of AI are recounted and discussed, with particular attention at pointing out misstatements or misconceptions of the classics, even in well-known references. Positioning and pointing control of a large-deformable beam is given as an example.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd><italic>Deep learning</italic></kwd>
<kwd>breakthroughs</kwd>
<kwd>network architectures</kwd>
<kwd>backpropagation</kwd>
<kwd>stochastic optimization methods from classic to modern</kwd>
<kwd>recurrent neural networks</kwd>
<kwd>long short-term memory</kwd>
<kwd>gated recurrent unit</kwd>
<kwd>attention</kwd>
<kwd>transformer</kwd>
<kwd>kernel machines</kwd>
<kwd>Gaussian processes</kwd>
<kwd>libraries</kwd>
<kwd>Physics-Informed Neural Networks</kwd>
<kwd>state-of-the-art</kwd>
<kwd>history</kwd>
<kwd>limitations</kwd>
<kwd>challenges</kwd>
<kwd><italic>applications to computational mechanics</italic></kwd>
<kwd>Finite-element matrix integration</kwd>
<kwd>improved Gauss quadrature</kwd>
<kwd>Multiscale geomechanics</kwd>
<kwd>fluid-filled porous media</kwd>
<kwd>Fluid mechanics</kwd>
<kwd>turbulence</kwd>
<kwd>proper orthogonal decomposition</kwd>
<kwd><italic>Nonlinear-manifold model-order reduction</italic></kwd>
<kwd>autoencoder</kwd>
<kwd>hyper-reduction using gappy data</kwd>
<kwd>control of large deformable beam</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<p><disp-quote>
<p><italic>A classic never dies</italic>.</p>
<p><italic>&#x201C;We must welcome the future, for it soon will be the past, and we must respect the past, for it was once all that was humanly possible.&#x201D;</italic></p>
<p>George Santayana</p></disp-quote></p>
<boxed-text>
<caption><title>SUMMARY</title></caption>
<p>Three representative applications of deep learning in computational mechanics&#x2013;involving numerical integration for finite element, complex constitutive model in solid mechanics, and proper orthogonal decomposition in fluid mechanics&#x2013;are reviewed in detail, and used as motivation for a further in-depth review of some key technologies of deep learning, building up from the basics to the state of the art, focusing on, to the extent possible, the most recent papers that had an important impact in the field.</p>
<p>Both static and dynamic time-dependent problems are discussed. Discrete time-dependent problems, as a sequence of data, can be modeled with recurrent neural networks, using the 1997 classic architecture such as Long Short-Term Memory (LSTM), but also the recent 2017-18 architectures such as transformer, based on the concept of attention, all of which are discussed in detail. Continuous recurrent neural networks originally developed in neuroscience to model the brain and the connection to their discrete counterparts in deep learning are also discussed in detail.</p>
<p>For training networks&#x2013;i.e., finding optimal parameters that yield low training error and lowest validation error&#x2013;both classic deterministic optimization methods (using full batch) and stochastic optimization methods (using minibatches) are reviewed in detail, and at times even derived. Deterministic gradient descent with classical line search methods, such as Armijo&#x2019;s rule, were generalized to add stochasticity. Detailed pseudocodes for these methods are provided. The classic stochastic gradient descent (SGD), with add-on tricks such as momentum, step-length decay, cyclic annealing, weight decay are presented, often with detailed derivations.</p>
<p>Step-length decay is shown to be equivalent to simulated annealing using stochastic differential equation equivalent to the discrete parameter update. A consequence is to increase the minibatch size, instead of decaying the step length. In particular, we obtain a new result for minibatch-size increase.</p>
<p>Highly popular adaptive step-length (learning-rate) methods are discussed in a unified manner, which covers AdaGrad, RMSProp, the &#x201C;immensely successful&#x201D; Adam and its variants, through to the recent AdamW.</p>
<p>Overlooked in (or unknown to) other review papers and even well-known books on deep learning, exponential smoothing of time series, the key technique of adaptive methods, and originating from the field of forecasting dated since the 1950s, is carefully explained.</p>
<p>Particular attention is given to a recent criticism of adaptive methods, revealing their marginal value for generalization, compared to good old SGD with effective initial step-length tuning and decay. The results were confirmed in three recent independent papers.</p>
<p>Kernel machines, including Gaussian processes, a most important class of non-parametric modeling with accurate uncertainty estimates, are introduced to sufficient details to prepare for more advanced works on networks with infinite width that constitute the 2021 breakthrough in computer science.</p>
<p>Applications of deep learning in computational mechanics often aim at reducing computational cost, which naturally connects to the field of (nonlinear) model-order reduction (MOR). We review how LSTM networks were trained to predict the rate-dependent constitutive response in multi-scale problem of porous media and how they were used as time-integrators for reduced order models (ROMs) inferred from highly-resolved direct numerical simulations of turbulent flows. Autoencoders based on shallow networks provide effective means in nonlinear manifold-based MOR and hyper-reduction method built on top.</p>
<p>A rare feature of the present paper is in a detailed review of some important classics to connect to the relevant concepts in modern literature, sometimes revealing misunderstanding in recent works, which was likely due to a lack of verification of the assertions made with the corresponding classics. For example, the first artificial neural network, conceived by Rosenblatt (1957) [<xref ref-type="bibr" rid="ref-1">1</xref>], (1962) [<xref ref-type="bibr" rid="ref-2">2</xref>] had 1000 neurons, but was reported as having a single neuron. Going beyond probabilistic analysis, Rosenblatt even built the Mark I computer to implement his 1000-neuron network. Another example is the &#x201C;heavy ball&#x201D; method, for which everyone referred to [<xref ref-type="bibr" rid="ref-3">3</xref>], but who more precisely called the &#x201C;small heavy sphere&#x201D; method. Others were quick to dismiss classical deterministic line-search methods that have been generalized to add stochasticity for network training. Unintended misrepresentation of the classics would mislead first-time learners, and unfortunately even seasoned researchers who used second-hand information from others, without checking the original classics themselves.</p>
<p>The experiments in the 1950s that discovered the rectified linear behavior in neuronal axon, modeled as a circuit with a diode, together with the use of the rectified linear activation function in neural networks in neuroscience years before being adopted for use in deep-learning networks, are reviewed.</p>
<p>The use of Volterra series to model the nolinear behavior of neuron in term of input and output firing rates, leading to continuous recurrent neural networks is examined in detail. The linear term of the Volterra series is a convolution integral that provides a theoretical foundation for the use of linear combination of inputs to a neuron, with weights and biases.</p>
<p>A goal of this in-depth review is not only to provide the state of the art for computational-mechanics readers with some familiarity of deep-learning networks, but also with first-time learners in mind, by developing relevant fundamental concepts from the basics. Moreover, for the convenience of the readers, detailed references are provided, e.g., page numbers in thick books, links to online references and open reviews where available.</p></boxed-text>
<sec id="s1"><label>1</label><title>Opening remarks and organization</title>
<p><italic>Breakthroughs due to AI in arts and science</italic>. On 2022.08.29, Figure <xref ref-type="fig" rid="fig-1">1</xref>, an image generated by the AI software <ext-link ext-link-type="uri" xlink:href="https://www.midjourney.com/home/">Midjourney</ext-link>, became one of the first of its kind to win first place in an art contest.<xref ref-type="fn" rid="fn1"><sup>1</sup></xref><fn id="fn1"><label>1</label><p>See also the Midjourney <ext-link ext-link-type="uri" xlink:href="https://www.midjourney.com/showcase/">Showcase</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://www.midjourney.com/showcase/">Internet archived on 2022.09.07</ext-link>, the video <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=_WfDlJY1nog">Guide to MidJourney AI Art - How to get started FREE!</ext-link> and several other Midjourney tutorial videos on Youtube.</p></fn> The image author signed his entry to the contest as &#x201C;Jason M. Allen via Midjourney,&#x201D; indicating that the submitted digital art was not created by him in the traditional way, but under his text commands to an AI software. Artists not using AI software&#x2014;such as Midjourney, <ext-link ext-link-type="uri" xlink:href="https://openai.com/dall-e-2/">DALL.E 2</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://stability.ai/blog/stable-diffusion-public-release">Stable Diffusion</ext-link>&#x2014;were not happy [<xref ref-type="bibr" rid="ref-4">4</xref>].</p>
<p>In 2021, an AI software achieved a feat that human researchers were not able to do in the last 50 years in predicting protein structures quickly and in a large scale. This feat was named the scientific breakthough of the year; Figure <xref ref-type="fig" rid="fig-2">2</xref>, left. In 2016, another AI software beat the world grandmaster in the Go game, which is described as the most complex game that human ever created; Figure <xref ref-type="fig" rid="fig-2">2</xref>, right.</p>
<p>On 2022.10.05, DeepMind published a paper on breaking a 50-year record of fast matrix multiplication by reducing the number of multiplications in multiplying two <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> matrices from 49 to 47 (with the traditional method requiring 64 multiplications), owing to an algorithm discovered with the help of their AI software AlphaTensor [<xref ref-type="bibr" rid="ref-8">8</xref>].<xref ref-type="fn" rid="fn2"><sup>2</sup></xref><fn id="fn2"><label>2</label><p>Their goal was of course to discover fast multiplication algorithms for matrices of arbitrarily large size. See also &#x201C;Discovering novel algorithms with AlphaTensor,&#x201D; DeepMind, 2022.10.05, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20221014160200/https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor">Internet archive</ext-link>.</p></fn> Just barely a week later, two mathematicians announced an algorithm that required only 46 multiplications.</p>
<p>Since the preprint of this paper was posted on the arXiv in Dec 2022 [<xref ref-type="bibr" rid="ref-9">9</xref>], there have been considerable excitements and concerns about ChatGPT&#x2014;a large language-model chatbot that can interact with humans in a conversational way&#x2014;which would be incorporated into Microsoft Bing to make web &#x201C;search interesting again, after years of stagnation and stasis&#x201D; [<xref ref-type="bibr" rid="ref-10">10</xref>], whose author wrote &#x201C;I&#x2019;m going to do something I thought I&#x2019;d never do: I&#x2019;m switching my desktop computer&#x2019;s default search engine to Bing. And Google, my default source of information for my entire adult life, is going to have to fight to get me back.&#x201D; Google would release its own answer to ChatGPT called &#x201C;Bard&#x201D; [<xref ref-type="bibr" rid="ref-11">11</xref>]. The race is on.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption><title><italic>AI-generated image won contest</italic> in the category of Digital Arts, Emerging Artists, on 2022.08.29 (Section <xref ref-type="sec" rid="s1">1</xref>). &#x201C;Th&#x00E9;&#x00E2;tre D&#x2019;op&#x00E9;ra Spatial&#x201D; (Space Opera Theater) by &#x201C;Jason M. Allen via Midjourney&#x201D;, which is &#x201C;an artificial intelligence program that turns lines of text into hyper-realistic graphics&#x201D; [<xref ref-type="bibr" rid="ref-4">4</xref>]. Colorado State Fair, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220904163544/https://coloradostatefair.com/wp-content/uploads/2022/08/2022-Fine-Arts-First-Second-Third.pdf">2022 Fine Arts First, Second &amp; Third</ext-link>. (Permission of Jason M. Allen, CEO, <ext-link ext-link-type="uri" xlink:href="http://www.incarnategames.com/blog/">Incarnate Games</ext-link>)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-1.tif"/>
</fig>
<p><italic>Audience</italic>. This review paper is written by mechanics practitioners to mechanics practitioners, who may or may not be familiar with neural networks and deep learning. We thus assume that the readers are familiar with continuum mechanics and numerical methods such as the finite element method. Thus, unlike typical computer-science papers on deep learning, notation and convention of tensor analysis familiar to practitioners of mechanics are used here whenever possible.<xref ref-type="fn" rid="fn3"><sup>3</sup></xref><fn id="fn3"><label>3</label><p>Tensors are not matrices; other concepts are summation convention on repeated indices, chain rule, and matrix index convention for natural conversion from component form to matrix (and then tensor) form. See Section <xref ref-type="sec" rid="s4_2">4.2</xref> on Matrix notation.</p></fn></p>
<p>For readers not familiar with deep learning, unlike many other review papers, this review paper is not just a summary of papers in the literature for people who already have some familiarity with this topic,<xref ref-type="fn" rid="fn4"><sup>4</sup></xref><fn id="fn4"><label>4</label><p>See the review papers on deep learning, e.g., [<xref ref-type="bibr" rid="ref-12">12</xref>] [<xref ref-type="bibr" rid="ref-13">13</xref>] [<xref ref-type="bibr" rid="ref-14">14</xref>] [<xref ref-type="bibr" rid="ref-15">15</xref>] [<xref ref-type="bibr" rid="ref-16">16</xref>] [<xref ref-type="bibr" rid="ref-17">17</xref>] [<xref ref-type="bibr" rid="ref-18">18</xref>], many of which did not provide extensive discussion on applications, particularly on computational mechanics, such as in the present review paper.</p></fn> particularly papers on deep-learning neural networks, but contains also a tutorial on this topic aiming at bringing first-time learners (including students) quickly up-to-date with modern issues and applications of deep learning, especially to computational mechanics.<xref ref-type="fn" rid="fn5"><sup>5</sup></xref><fn id="fn5"><label>5</label><p>An example of a confusing point for <italic>first-time learners</italic> with knowledge of electrical circuits, hydraulics, or (biological) computational neuroscience [<xref ref-type="bibr" rid="ref-19">19</xref>] would be the interpretation of the arrows in an artificial neural network such as those in Figure <xref ref-type="fig" rid="fig-7">7</xref> and Figure <xref ref-type="fig" rid="fig-8">8</xref>: Would these arrows represent real physical flows (electron flow, fluid flow, etc.)? No, they represent function mapping (or information passing); see Section <xref ref-type="sec" rid="s4_3_1">4.3.1</xref> on Graphical representation. Even a tutorial such as [<xref ref-type="bibr" rid="ref-20">20</xref>] would follow the same format as many other papers, and while alluding to the human brain in their Figure <xref ref-type="fig" rid="fig-2">2</xref> (which is the equivalent of Figure <xref ref-type="fig" rid="fig-8">8</xref> below), did not explain the meaning of the arrows.</p></fn> As a result, this review paper is a convenient &#x201C;one-stop shopping&#x201D; that provides the necessary fundamental information, with clarification of potentially confusing points, for first-time learners to quickly acquire a general understanding of the field that would facilitate deeper study and application to computational mechanics.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption><title><italic>Breakthroughs in AI</italic> (Section <xref ref-type="sec" rid="s2">2</xref>). <italic>Left:</italic> The journal <italic>Science</italic> 2021 Breakthough of the Year. Protein folded 3-D shape produced by the AI software AlphaFold compared to experiment with high accuracy [<xref ref-type="bibr" rid="ref-5">5</xref>]. The <ext-link ext-link-type="uri" xlink:href="https://alphafold.ebi.ac.uk/">AlphaFold Protein Structure Database</ext-link> contains more than 200 million protein structure predictions, a holy grail sought after in the last 50 years. <italic>Right:</italic> The AI solfware AlphaGo, a runner-up in the journal <italic>Science</italic> 2016 Breakthough of the Year, beat the European Go champion Fan Hui five games to zero in 2015 [<xref ref-type="bibr" rid="ref-6">6</xref>], and then went on to defeat the world Go grandmaster Lee Sedol in 2016 [<xref ref-type="bibr" rid="ref-7">7</xref>]. (Permission by <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/nature-portfolio/reprints-and-permissions">Nature</ext-link>.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-2.tif"/>
</fig>
<p><italic>Deep-learning software libraries</italic>. Just as there is a large number of available software in different subfields of computational mechanics, there are many excellent deep-learning libraries ready for use in applications; see Section <xref ref-type="sec" rid="s9">9</xref>, in which some examples of the use of these libraries in engineering applications are provided with the associated computer code. Similar to learning finite-element formulations versus learning how to run finite-element codes, our focus here is to discuss various algorithmic aspects of deep-learning and their applications in computational mechanics, rather than how to use deep-learning libraries in applications. We agree with the view that &#x201C;a solid understanding of the core principles of neural networks and deep learning&#x201D; would provide &#x201C;insights that will still be relevant years from now&#x201D; [<xref ref-type="bibr" rid="ref-21">21</xref>], and that would not be obtained from just learning to run some hot libraries.</p>
<p>Readers already familiar with neural networks may find the presentation refreshing,<xref ref-type="fn" rid="fn6"><sup>6</sup></xref><fn id="fn6"><label>6</label><p>Particularly the top-down approach for both feedforward network (Section <xref ref-type="sec" rid="s4">4</xref>) and back propagation (Section <xref ref-type="sec" rid="s5">5</xref>).</p></fn> and even find new information on neural networks, depending how they used deep learning, or when they stopped working in this area due to the waning wave of connectionism and the new wave of deep learning.<xref ref-type="fn" rid="fn7"><sup>7</sup></xref><fn id="fn7"><label>7</label><p>It took five years from the publication of Rumelhart <italic>et al</italic>. 1986 [<xref ref-type="bibr" rid="ref-22">22</xref>] to the paper by Ghaboussi <italic>et al</italic>. 1991 [<xref ref-type="bibr" rid="ref-23">23</xref>], in which backpropagation (Section <xref ref-type="sec" rid="s5">5</xref>) was applied. It took more than twenty years from the publication of Long Short-Term Memory (LSTM) units in [<xref ref-type="bibr" rid="ref-24">24</xref>] to the two recent papers [<xref ref-type="bibr" rid="ref-25">25</xref>] and [<xref ref-type="bibr" rid="ref-26">26</xref>], which are reviewed in detail here, and where recurrent neural networks (RNNs, Section <xref ref-type="sec" rid="s7">7</xref>) with LSTM units (Section <xref ref-type="sec" rid="s7_2">7.2</xref>) were applied, even though there were some early works on application of RNNs (without LSTM units) in civil / mechanical engineering such as [<xref ref-type="bibr" rid="ref-27">27</xref>] [<xref ref-type="bibr" rid="ref-28">28</xref>] [<xref ref-type="bibr" rid="ref-29">29</xref>] [<xref ref-type="bibr" rid="ref-30">30</xref>]. But already, &#x201C;fully attentional Transformer&#x201D; was proposed to render &#x201C;intricately constructed LSTM&#x201D; unnecessary [<xref ref-type="bibr" rid="ref-31">31</xref>]. Most modern networks use the default rectified linear function (ReLU)&#x2013;which was introduced in computational neuroscience since at least before [<xref ref-type="bibr" rid="ref-32">32</xref>] and [<xref ref-type="bibr" rid="ref-19">19</xref>], and then adopted in computer science beginning with [<xref ref-type="bibr" rid="ref-33">33</xref>] and [<xref ref-type="bibr" rid="ref-34">34</xref>]&#x2013;instead of the traditional sigmoid function dated since the mid 1970s with [<xref ref-type="bibr" rid="ref-35">35</xref>], but yet many newer activation functions continue to appear regularly, aiming at improving accuracy and efficiency over previous activation functions, e.g., [<xref ref-type="bibr" rid="ref-36">36</xref>], [<xref ref-type="bibr" rid="ref-37">37</xref>]. In computational mechanics, by the beginning of 2019, there has not yet widespread use of ReLU activation function, even though ReLU was mentioned in [<xref ref-type="bibr" rid="ref-38">38</xref>], where the sigmoid function was actually employed to obtain the results (Section <xref ref-type="sec" rid="s10">10</xref>). See also Section <xref ref-type="sec" rid="s13">13</xref> on Historical perspective.</p></fn> If not, readers can skip these sections to go directly to the sections on applications of deep learning to computational mechanics.</p>
<p><italic>Applications of deep learning in computational mechanics</italic>. We select some recent papers on application of deep learning to computational mechanics to review in details in a way that readers would also understand the computational mechanics contents well enough without having to read through the original papers:</p>
<list list-type="bullet">
<list-item><p>Fully-connected feedforward neural networks were employed to make element-matrix integration more efficient, while retaining the accuracy of the traditional Gauss-Legendre quadrature [<xref ref-type="bibr" rid="ref-38">38</xref>];<xref ref-type="fn" rid="fn8"><sup>8</sup></xref><fn id="fn8"><label>8</label><p>It would be interesting to investigate on how the adjusted integration weights using the method in [<xref ref-type="bibr" rid="ref-38">38</xref>] would affect the stability of an element stiffness matrix with reduced integration (even in the absence of locking) and the superconvergence of the strains / stresses at the Barlow sampling points. See, e.g., [<xref ref-type="bibr" rid="ref-39">39</xref>], p. 499. The optimal locations of these strain / stress sampling points do not depend on the integration weights, but only on the degree of the interpolation polynomials; see [<xref ref-type="bibr" rid="ref-40">40</xref>] [<xref ref-type="bibr" rid="ref-41">41</xref>]. &#x201C;The Gauss points corresponding to reduced integration are the Barlow points (Barlow, 1976) at which the strains are most accurately predicted if the elements are well-shaped&#x201D; [<xref ref-type="bibr" rid="ref-42">42</xref>].</p></fn></p> </list-item>
<list-item><p>Recurrent neural network (RNN) with Long Short-Term Memory (LSTM) units<xref ref-type="fn" rid="fn9"><sup>9</sup></xref><fn id="fn9"><label>9</label><p>It is only a coincidence that (1) Hochreiter (1997), the first author in [<xref ref-type="bibr" rid="ref-24">24</xref>], which was the original paper on the widely used and highly successful Long Short-Term Memory (LSTM) unit, is on the faculty at Johannes Kepler University (home institution of this paper&#x2019;s author A.H.), and that (2) Ghaboussi (1991), the first author in [<xref ref-type="bibr" rid="ref-23">23</xref>], who was among the first researchers to apply fully-connected feedforward neural network to constitutive behavior in solid mechanics, was on the faculty at the University of Illinois at Urbana-Champaign (home institution of author L.V.Q.). See also [<xref ref-type="bibr" rid="ref-43">43</xref>], and for early applications of neural networks in other areas of mechanics, see e.g., [<xref ref-type="bibr" rid="ref-44">44</xref>], [<xref ref-type="bibr" rid="ref-45">45</xref>], [<xref ref-type="bibr" rid="ref-46">46</xref>].</p></fn> was applied to multiple-scale, multi-physics problems in solid mechanics [<xref ref-type="bibr" rid="ref-25">25</xref>];</p> </list-item>
<list-item><p>RNNs with LSTM units were employed to obtain reduced-order model for turbulence in fluids based on the proper orthogonal decomposition (POD), a classic linear project method also known as principal components analysis (PCA) [<xref ref-type="bibr" rid="ref-26">26</xref>]. More recent nonlinear-manifold model-order reduction methods, incorporating encoder / decoder and hyper-reduction of dimentionality using gappy (incomplete) data, were introduced, e.g., [<xref ref-type="bibr" rid="ref-47">47</xref>] [<xref ref-type="bibr" rid="ref-48">48</xref>].</p></list-item></list>
<p><italic>Organization of contents</italic>. Our review of each of the above papers is divided into two parts. The first part is to summarize the main results and to identify the concepts of deep learning used in the paper, expected to be new for first-time learners, for subsequent elaboration. The second part is to explain in details how these deep-learning concepts were used to produce the results.</p>
<p>The results of deep-learning numerical integration [<xref ref-type="bibr" rid="ref-38">38</xref>] are presented in Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>, where the deep-learning concepts employed are identified and listed, whereas the details of the formulation in [<xref ref-type="bibr" rid="ref-38">38</xref>] are discussed in Section <xref ref-type="sec" rid="s10">10</xref>. Similarly, the results and additional deep-learning concepts used in a multi-scale, multi-physics problem of geomechanics [<xref ref-type="bibr" rid="ref-25">25</xref>] are presented in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, whereas the details of this formulation are discussed in Section <xref ref-type="sec" rid="s11">11</xref>. Finally, the results and additional deep-learning concepts used in turbulent fluid simulation with proper orthogonal decomposition [<xref ref-type="bibr" rid="ref-26">26</xref>] are presented in Section <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>, whereas the details of this formulation, together with the nonlinear-manifold model-order reduction [<xref ref-type="bibr" rid="ref-47">47</xref>] [<xref ref-type="bibr" rid="ref-48">48</xref>], are discussed in Section <xref ref-type="sec" rid="s12">12</xref>.</p>
<p>All of the deep-learning concepts identified from the above selected papers for in-depth are subsequently explained in detail in Sections <xref ref-type="sec" rid="s3">3</xref> to <xref ref-type="sec" rid="s7">7</xref>, and then more in Section <xref ref-type="sec" rid="s13">13</xref> on &#x201C;Historical perspective&#x201D;.</p>
<p>The parallelism between computational mechanics, neuroscience, and deep learning is summarized in Section <xref ref-type="sec" rid="s3">3</xref>, which would put computational-mechanics first-time learners at ease, before delving into the details of deep-learning concepts.</p>
<p>Both time-independent (static) and time-dependent (dynamic) problems are discussed. The architecture of (static, time-independent) feedforward multilayer neural networks in Section <xref ref-type="sec" rid="s4">4</xref> is expounded in detail, with first-time learners in mind, without assuming prior knowledge, and where experts may find a refreshing presentation and even new information.</p>
<p>Backpropagation, explained in Section <xref ref-type="sec" rid="s5">5</xref>, is an important method to compute the gradient of the cost function relative to the network parameters for use as a descent direction to decrease the cost function for network training.</p>
<p>For training networks&#x2014;i.e., finding optimal parameters that yield low training error and lowest validation error&#x2014;both classic deterministic optimization methods (using full batch) and stochastic optimization methods (using minibatches) are reviewed in detail, and at times even derived, in Section <xref ref-type="sec" rid="s6">6</xref>, which would be useful for both first-time learners and experts alike.</p>
<p>The examples used in training a network form the training set, which is complemented by the validation set (to determine when to stop the optimization iterations) and the test set (to see whether the resulting network could work on examples never seen before); see Section <xref ref-type="sec" rid="s6_1">6.1</xref>.</p>
<p><xref ref-type="sec" rid="s6_2">Deterministic gradient descent</xref> with classical line search methods, such as <xref ref-type="sec" rid="s6_2_3">Armijo&#x2019;s rule</xref> (Section <xref ref-type="sec" rid="s6_2">6.2</xref>), were generalized to add stochasticity. Detailed pseudocodes for these methods are provided. The classic stochastic gradient descent (<xref ref-type="sec" rid="s6_3">SGD</xref>) by Robbins &amp; Monro (1951) [<xref ref-type="bibr" rid="ref-49">49</xref>] (Section <xref ref-type="sec" rid="s6_3">6.3</xref>, Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref>), with add-on tricks such as <xref ref-type="sec" rid="s6_3_2">momentum</xref> Polyak (1964) [<xref ref-type="bibr" rid="ref-3">3</xref>] and <xref ref-type="sec" rid="s6_3_2">fast (accelerated) gradient</xref> by Nesterov (1983 [<xref ref-type="bibr" rid="ref-50">50</xref>], 2018 [<xref ref-type="bibr" rid="ref-51">51</xref>]) (Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref>), <xref ref-type="sec" rid="s6_3_4">step-length decay</xref> (Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref>), <xref ref-type="sec" rid="s6_3_4">cyclic annealing</xref> (Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref>), <xref ref-type="sec" rid="s6_3_5">minibatch-size increase</xref> (Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>), <xref ref-type="sec" rid="s6_3_6">weight decay</xref> (Section <xref ref-type="sec" rid="s6_3_6">6.3.6</xref>) are presented, often with detailed derivations.</p>
<p><xref ref-type="sec" rid="s6_3_4">Step-length decay</xref> is shown to be equivalent to simulated annealing using stochastic differential equation equivalent to the discrete parameter update. A consequence is to increase the minibatch size, instead of decaying the step length (Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>). In particular, we obtain a new result for minibatch-size increase.</p>
<p>In Section <xref ref-type="sec" rid="s6_5">6.5</xref>, highly popular <xref ref-type="sec" rid="s6_5">adaptive step-length</xref> (learning-rate) methods are discussed in a unified manner in Section <xref ref-type="sec" rid="s6_5_1">6.5.1</xref>, followed by the first paper on <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref> [<xref ref-type="bibr" rid="ref-52">52</xref>] (Section <xref ref-type="sec" rid="s6_5_2">6.5.2</xref>).</p>
<p>Overlooked in (or unknown to) other review papers and even well-known books on deep learning, exponential smoothing of time series originating from the field of forecasting dated since the 1950s, the key technique of adaptive methods, is carefully explained in Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref>.</p>
<p>The first adaptive methods that employed exponential smoothing were <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> [<xref ref-type="bibr" rid="ref-53">53</xref>] (Section <xref ref-type="sec" rid="s6_5_4">6.5.4</xref>) and <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref> [<xref ref-type="bibr" rid="ref-54">54</xref>] (Section <xref ref-type="sec" rid="s6_5_5">6.5.5</xref>), both introduced at about the same time, followed by the &#x201C;immensely successful&#x201D; <xref ref-type="sec" rid="s6_5_6">Adam</xref> (Section <xref ref-type="sec" rid="s6_5_6">6.5.6</xref>) and its variants (Sections <xref ref-type="sec" rid="s6_5_7">6.5.7</xref> and <xref ref-type="sec" rid="s6_5_8">6.5.8</xref>).</p>
<p>Particular attention is then given to a recent criticism of adaptive methods in [<xref ref-type="bibr" rid="ref-55">55</xref>], revealing their marginal value for generalization, compared to the good old SGD with effective initial step-length tuning and step-length decay (Section <xref ref-type="sec" rid="s6_5_9">6.5.9</xref>). The results were confirmed in three recent independent papers, among which is the recent <xref ref-type="sec" rid="s6_5_10">AdamW</xref> adaptive method in [<xref ref-type="bibr" rid="ref-56">56</xref>] (Section <xref ref-type="sec" rid="s6_5_10">6.5.10</xref>).</p>
<p>Dynamics, sequential data, and sequence modeling are the subjects of Section <xref ref-type="sec" rid="s7">7</xref>. Discrete time-dependent problems, as a sequence of data, can be modeled with recurrent neural networks discussed in Section <xref ref-type="sec" rid="s7_1">7.1</xref>, using the 1997 classic architecture such as Long Short-Term Memory (LSTM) in Section <xref ref-type="sec" rid="s7_2">7.2</xref>, but also the recent 2017-18 architectures such as transformer introduced in [<xref ref-type="bibr" rid="ref-31">31</xref>] (Section <xref ref-type="sec" rid="s7_4_3">7.4.3</xref>), based on the concept of attention [<xref ref-type="bibr" rid="ref-57">57</xref>]. Continuous recurrent neural networks originally developed in neuroscience to model the brain and the connection to their discrete counterparts in deep learning are also discussed in detail, [<xref ref-type="bibr" rid="ref-19">19</xref>] and Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> on &#x201C;Dynamic, time dependence, Volterra series&#x201D;.</p>
<p>The features of several popular, open-source deep-learning frameworks and libraries&#x2014;such as TensorFlow, Keras, PyTorch, etc.&#x2014;are summarized in Section <xref ref-type="sec" rid="s9">9</xref>.</p>
<p>As mentioned above, detailed formulations of deep learning applied to computational mechanics in [<xref ref-type="bibr" rid="ref-38">38</xref>] [<xref ref-type="bibr" rid="ref-25">25</xref>] [<xref ref-type="bibr" rid="ref-26">26</xref>] [<xref ref-type="bibr" rid="ref-47">47</xref>] [<xref ref-type="bibr" rid="ref-48">48</xref>] are reviewed in Sections <xref ref-type="sec" rid="s10">10</xref>, <xref ref-type="sec" rid="s11">11</xref>, <xref ref-type="sec" rid="s12">12</xref>.</p>
<p><italic>History of AI, limitations, danger, and the classics</italic>. Finally, a broader historical perspective of deep learning, machine learning, and artificial intelligence is discussed in Section <xref ref-type="sec" rid="s13">13</xref>, ending with comments on the geopolitics, limitations, and (identified-and-proven, not just speculated) danger of artificial intelligence in Section <xref ref-type="sec" rid="s14">14</xref>.</p>
<p>A rare feature is in a detailed review of some important classics to connect to the relevant concepts in modern literature, sometimes revealing misunderstanding in recent works, likely due to a lack of verification of the assertions made with the corresponding classics. For example, the first artificial neural network, conceived by Rosenblatt (1957) [<xref ref-type="bibr" rid="ref-1">1</xref>], (1962) [<xref ref-type="bibr" rid="ref-2">2</xref>], had 1000 neurons, but was reported as having a single neuron (Figure <xref ref-type="fig" rid="fig-42">42</xref>). Going beyond probabilistic analysis, Rosenblatt even built the Mark I computer to implement his 1000-neuron network (Figure <xref ref-type="fig" rid="fig-133">133</xref>, Sections <xref ref-type="sec" rid="s13_2">13.2</xref> and <xref ref-type="sec" rid="s13_2_1">13.2.1</xref>). Another example is the &#x201C;heavy ball&#x201D; method, for which everyone referred to Polyak (1964) [<xref ref-type="bibr" rid="ref-3">3</xref>], but who more precisely called the &#x201C;small heavy sphere&#x201D; method (Remark <xref ref-type="statement" rid="st6_6">6.6</xref>). Others were quick to dismiss classical deterministic line-search methods that have been generalized to add stochasticity for network training (Remark <xref ref-type="statement" rid="st6_4">6.4</xref>). Unintended misrepresentation of the classics would mislead first-time learners, and unfortunately even seasoned researchers who used second-hand information from others, without checking the original classics themselves.</p>
<p>The use of Volterra series to model the nonlinear behavior of neuron in term of input and output firing rates, leading to continuous recurrent neural networks is examined in detail. The linear term of the Volterra series is a convolution integral that provides a theoretical foundation for the use of linear combination of inputs to a neuron, with weights and biases [<xref ref-type="bibr" rid="ref-19">19</xref>]; see Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref>.</p>
<p>The experiments in the 1950s by Furshpan et al. [<xref ref-type="bibr" rid="ref-58">58</xref>] [<xref ref-type="bibr" rid="ref-59">59</xref>] that revealed the rectified linear behavior in neuronal axon, modeled as a circuit with a diode, together with the use of the rectified linear activation function in neural networks in neuroscience years before being adopted for use in deep learning network, are reviewed in Section <xref ref-type="sec" rid="s13_3_2">13.3.2</xref>.</p>
<p><italic>Reference hypertext links and Internet archive</italic>. For the convenience of the readers, whenever we refer to an online article, we provide both the link to original website, and if possible, also the link to its archived version in the Internet Archive. For example, we included in the bibliography entry of Ref. [<xref ref-type="bibr" rid="ref-60">60</xref>] the links to both the <ext-link ext-link-type="uri" xlink:href="http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414">Original website</ext-link> and the <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20181110195900/http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414">Internet archive</ext-link>.<xref ref-type="fn" rid="fn10"><sup>10</sup></xref><fn id="fn10"><label>10</label><p>While in the long run an original website may be moved or even deleted, the same website captured on the Internet Archive (also known as Web Archive or Wayback Machine) remains there permanently.</p></fn></p>
</sec>
<sec id="s2"><label>2</label>
<title>Deep Learning, resurgence of Artificial Intelligence</title>
<p>In Dec 2021, the journal <italic>Science</italic> named, as its &#x201C;2021 Breakthrough of the Year,&#x201D; the development of the AI software AlphaFold and its amazing feat of predicting a large number of protein structures [<xref ref-type="bibr" rid="ref-64">64</xref>]. &#x201C;For nearly 50 years, scientists have struggled to solve one of nature&#x2019;s most perplexing challenges&#x2014;predicting the complex 3D shape a string of amino acids will twist and fold into as it becomes a fully functional protein. This year, scientists have shown that artificial intelligence (AI)-driven software can achieve this long-standing goal and predict accurate protein structures by the thousands and at a fraction of the time and cost involved with previous methods&#x201D; [<xref ref-type="bibr" rid="ref-64">64</xref>].</p>

<fig id="fig-3">
<label>Figure 3</label>
<caption><title><italic>ImageNet competitions</italic> (Section <xref ref-type="sec" rid="s2">2</xref>). Top (smallest) classification error rate versus competition year. A sharp decrease in error rate in 2012 sparked a resurgence in AI interest and research [<xref ref-type="bibr" rid="ref-13">13</xref>]. By 2015, the top classification error rate surpassed human classification error rate of 5.1% with Parametric Rectified Linear Unit [<xref ref-type="bibr" rid="ref-61">61</xref>]; see Section <xref ref-type="sec" rid="s5_3_3">5.3.3</xref> and also [<xref ref-type="bibr" rid="ref-62">62</xref>]. Figure from [<xref ref-type="bibr" rid="ref-63">63</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-3.tif"/>
</fig>
<p>The 3-D shape of a protein, obtained by folding a linear chain of amino acid, determines how this protein would interact with other molecules, and thus establishes its biological functions [<xref ref-type="bibr" rid="ref-64">64</xref>]. There are some 200 million proteins, the building blocks of life, in all living creatures, and 400,000 in the human body [<xref ref-type="bibr" rid="ref-64">64</xref>]. The <ext-link ext-link-type="uri" xlink:href="https://alphafold.ebi.ac.uk/">AlphaFold Protein Structure Database</ext-link> already contained &#x201C;over 200 million protein structure predictions.&#x201D;<xref ref-type="fn" rid="fn11"><sup>11</sup></xref><fn id="fn11"><label>11</label><p>See also AlphaFold Protein Structure Database <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220902132758/https://alphafold.ebi.ac.uk/">Internet archived as of 2022.09.02</ext-link>.</p></fn> For comparison, there were only about 190 thousand protein structures obtained through experiments as of 2022.07.28 [<xref ref-type="bibr" rid="ref-65">65</xref>]. &#x201C;Some of AlphaFold&#x2019;s predictions were on par with very good experimental models [Figure <xref ref-type="fig" rid="fig-2">2</xref>, left], and potentially precise enough to detail atomic features useful for drug design, such as the active site of an enzyme&#x201D; [<xref ref-type="bibr" rid="ref-66">66</xref>]. The influence of this software and its developers &#x201C;would be epochal.&#x201D;</p>
<p>On the 2019 new-year day, <italic>The Guardian</italic> [<xref ref-type="bibr" rid="ref-67">67</xref>] reported the most recent breakthrough in AI, published less than a month before on 2018 Dec 07 in the journal <italic>Science</italic> in [<xref ref-type="bibr" rid="ref-68">68</xref>] on the development of the software AlphaZero, based on deep reinforcement learning (a combination of deep learning and reinforcement learning), that can teach itself through self-play, and then &#x201C;convincingly defeated a world champion program in the games of chess, shogi (Japanese chess), as well as Go&#x201D;; see Figure <xref ref-type="fig" rid="fig-2">2</xref>, right.</p>
<p>Go is the most complex game that mankind ever created, with more combinations of possible moves than chess, and thus the number of atoms in the observable universe.<xref ref-type="fn" rid="fn12"><sup>12</sup></xref><fn id="fn12"><label>12</label><p>The number of atoms in the observable universe is estimated at <inline-formula id="ieqn-3000"><mml:math id="mml-ieqn-3000"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>80</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. For a board game such as chess and Go, the number of possible sequences of moves is <inline-formula id="ieqn-3001"><mml:math id="mml-ieqn-3001"><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-3002"><mml:math id="mml-ieqn-3002"><mml:mi>b</mml:mi></mml:math></inline-formula> being the game breadth (or &#x201C;branching factor&#x201D;, which is the &#x201C;number of legal moves per position&#x201D; or average number of moves at each turn), and <inline-formula id="ieqn-3003"><mml:math id="mml-ieqn-3003"><mml:mi>d</mml:mi></mml:math></inline-formula> the game depth (or length, also known as number of &#x201C;plies&#x201D;). For chess, <inline-formula id="ieqn-3004"><mml:math id="mml-ieqn-3004"><mml:mi>b</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>35</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>80</mml:mn></mml:math></inline-formula>, and <inline-formula id="ieqn-3005"><mml:math id="mml-ieqn-3005"><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>35</mml:mn><mml:mrow><mml:mn>80</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2248;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>123</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, whereas For Go, <inline-formula id="ieqn-3006"><mml:math id="mml-ieqn-3006"><mml:mi>b</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>250</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>150</mml:mn></mml:math></inline-formula>, and <inline-formula id="ieqn-3007"><mml:math id="mml-ieqn-3007"><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>250</mml:mn><mml:mrow><mml:mn>150</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2248;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>360</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. See, e.g., &#x201C;Go and mathematics&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Go_and_mathematics&amp;oldid=845635147">version 03:40, 13 June 2018</ext-link>; &#x201C;Game-tree complexity&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Game_complexity&amp;oldid=863187839#Game-tree_complexity">version 07:04, 9 October 2018</ext-link>; [<xref ref-type="bibr" rid="ref-6">6</xref>].</p></fn> It is &#x201C;the most challenging of classic games for artificial intelligence [AI] owing to its enormous search space and the difficulty of evaluating board positions and moves&#x201D; [<xref ref-type="bibr" rid="ref-6">6</xref>].</p>
<p>This breakthrough is the crowning achievement in a string of astounding successes of deep learning (and reinforcenent learning) in taking on this difficult challenge for AI.<xref ref-type="fn" rid="fn13"><sup>13</sup></xref><fn id="fn13"><label>13</label><p>See [<xref ref-type="bibr" rid="ref-69">69</xref>] [<xref ref-type="bibr" rid="ref-6">6</xref>] [<xref ref-type="bibr" rid="ref-70">70</xref>] [<xref ref-type="bibr" rid="ref-71">71</xref>]. See also the film <ext-link ext-link-type="uri" xlink:href="https://www.alphagomovie.com/"><italic>AlphaGo</italic></ext-link> (2017), &#x201C;an excellent and surprisingly touching documentary about one of the great recent triumphs of artificial intelligence, Google DeepMind&#x2019;s victory over the champion Go player Lee Sedol&#x201D; [<xref ref-type="bibr" rid="ref-72">72</xref>], and &#x201C;AlphaGo versus Lee Sedol,&#x201D; Wikipedia <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=AlphaGo_versus_Lee_Sedol&amp;oldid=1108283463">version 14:59, 3 September 2022</ext-link>.</p></fn> The success of this recent breakthrough prompted an AI expert to declare close the multidecade long, arduous chapter of AI research to conquer immensely-complex games such as chess, shogi, and Go, and to suggest AI researchers to consider a new generation of games to provide the next set of challenges [<xref ref-type="bibr" rid="ref-73">73</xref>].</p>
<p>In its long history, AI research went through several cycles of ups and downs, in and out of fashion, as described in [<xref ref-type="bibr" rid="ref-74">74</xref>], &#x2018;Why artificial intelligence is enjoying a renaissance&#x2019; (see also Section <xref ref-type="sec" rid="s13">13</xref> on historical perspective):</p>
<disp-quote><p>&#x201C;THE TERM &#x201C;artificial intelligence&#x201D; has been associated with hubris and disappointment since its earliest days. It was coined in a research proposal from 1956, which imagined that significant progress could be made in getting machines to &#x201C;solve kinds of problems now reserved for humans if a carefully selected group of scientists work on it together for a summer&#x201D;. That proved to be rather optimistic, to say the least, and despite occasional bursts of progress and enthusiasm in the decades that followed, AI research became notorious for promising much more than it could deliver. Researchers mostly ended up avoiding the term altogether, preferring to talk instead about &#x201C;expert systems&#x201D; or &#x201C;neural networks&#x201D;. But in the past couple of years there has been a dramatic turnaround. Suddenly AI systems are achieving impressive results in a range of tasks, and people are once again using the term without embarrassment.&#x201D;</p>
</disp-quote>
<p>The recent resurgence of enthusiasm for AI research and applications dated only since 2012 with a spectacular success of almost halving the error rate in image classification in the ImageNet competition,<xref ref-type="fn" rid="fn14"><sup>14</sup></xref><fn id="fn14"><label>14</label><p>&#x201C;ImageNet is an online database of millions of images, all labelled by hand. For any given word, such as &#x201C;balloon&#x201D; or &#x201C;strawberry&#x201D;, ImageNet contains several hundred images. The annual ImageNet contest encourages those in the field To compete and measure their progress in getting computers to recognise and label images automatically&#x201D; [<xref ref-type="bibr" rid="ref-75">75</xref>]. See also [<xref ref-type="bibr" rid="ref-62">62</xref>] and [<xref ref-type="bibr" rid="ref-60">60</xref>], for a history of the development of ImageNet, which played a critical role in the resurgence of interest and research in AI by paving the way for the mentioned 2012 spectacular success in reducing the error rate in image recognition.</p></fn> Going from 26% down to 16%; Figure <xref ref-type="fig" rid="fig-3">3</xref> [<xref ref-type="bibr" rid="ref-63">63</xref>]. In 2015, deep-learning error rate of 3.6% was smaller than human-level error rate of 5.1%,<xref ref-type="fn" rid="fn15"><sup>15</sup></xref><fn id="fn15"><label>15</label><p>For a report on the human image classification error rate of 5.1%, see [<xref ref-type="bibr" rid="ref-76">76</xref>] and [<xref ref-type="bibr" rid="ref-62">62</xref>], Table 10.</p></fn> and then decreased by more than half to 2.3% by 2017.</p>
<p>The 2012 success<xref ref-type="fn" rid="fn16"><sup>16</sup></xref><fn id="fn16"><label>16</label><p>Actually, the first success of deep learning occurred three years earlier in 2009 in speech recognition; see Section <xref ref-type="sec" rid="s2">2</xref> regarding the historical perspective on the resurgence of AI.</p></fn> of a deep-learning application, which brought renewed interest in AI research out of its recurrent doldrums known as &#x201C;AI winters&#x201D;,<xref ref-type="fn" rid="fn17"><sup>17</sup></xref><fn id="fn17"><label>17</label><p>See [<xref ref-type="bibr" rid="ref-74">74</xref>].</p></fn> is due to the following reasons:</p>
<list list-type="bullet">
<list-item><p>Availability of much larger datasets for training deep neural networks (find optimized parameters). It is possible to say that without ImageNet, there would be no spectacular success in 2012, and thus no resurgence of AI. Once the importance of having large datasets to develop versatile, working deep networks was realized, many more large datasets have been developed. See, e.g., [<xref ref-type="bibr" rid="ref-60">60</xref>].</p></list-item>
<list-item><p>Emergence of more powerful computers than in the 1990s, e.g., the graphical processing unit (or GPU), &#x201C;which packs thousands of relatively simple processing cores on a single chip&#x201D; for use to process and display complex imagery, and to provide fast actions in today&#x2019;s video games&#x201D; [<xref ref-type="bibr" rid="ref-77">77</xref>].</p></list-item>
<list-item><p>Advanced software infrastructure (libraries) that facilitates faster development of deep-learning applications, e.g., TensorFlow, PyTorch, Keras, MXNet, etc. [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 25. See Section <xref ref-type="sec" rid="s9">9</xref> on some reviews and rankings of deep-learning libraries.</p></list-item>
<list-item><p>Larger neural networks and better training techniques (i.e., optimizing network parameters) that were not available in the 1980s. Today&#x2019;s much larger networks, which can solve once intractatable / difficult problems, are &#x201C;one of the most important trends in the history of deep learning&#x201D;, but are still much smaller than the nervous system of a frog [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 21; see also Section <xref ref-type="sec" rid="s4_6">4.6</xref>. A 2006 breakthrough, ushering in the dawn of a new wave of AI research and interest, has allowed for efficient training of deeper neural networks [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 18.<xref ref-type="fn" rid="fn18"><sup>18</sup></xref><fn id="fn18"><label>18</label><p>The authors of [<xref ref-type="bibr" rid="ref-13">13</xref>] cited this 2006 breakthrough paper by Hinton, Osindero &amp; Teh in their reference no.32 with the mention &#x201C;This paper introduced a novel and effective way of training very deep neural networks by pre-training one hidden layer at a time using the unsupervised learning procedure for restricted Boltzmann machines (RBMs).&#x201D; A few years later, it was found out that RBMs were not necessary to train deep networks, as it was sufficient to use rectified linear units (ReLUs) as active functions ([<xref ref-type="bibr" rid="ref-79">79</xref>], interview with Y. Bengio); see also Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref> on active functions. For this reason, we are not reviewing RBMs here.</p></fn> The training of large-scale deep neural networks, which frequently involve highly nonlinear and non-convex optimization problems with many local minima, owes its success to the use of <italic>stochastic-gradient</italic> descent method first introduced in the 1950s [<xref ref-type="bibr" rid="ref-80">80</xref>].</p></list-item>
<list-item><p>Successful applications to difficult, complex problems that help people in their every-day lives, e.g., image recognition, speech translation, etc.</p>
<p><inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mo>&#x22C6;</mml:mo></mml:math></inline-formula> In medicine, AI &#x201C;is beginning to meet (and sometimes exceed) assessments by doctors in various clinical situations. A.I. can now diagnose skin cancer like dermatologists, seizures like neurologists, and diabetic retinopathy like ophthalmologists. Algorithms are being developed to predict which patients will get diarrhea or end up in the ICU,<xref ref-type="fn" rid="fn19"><sup>19</sup></xref><fn id="fn19"><label>19</label><p>Intensive Care Unit.</p></fn> and the FDA<xref ref-type="fn" rid="fn20"><sup>20</sup></xref><fn id="fn20"><label>20</label><p>Food and Drug Administration.</p></fn> recently approved the first machine learning algorithm to measure how much blood flows through the heart&#x2014;a tedious, time-consuming calculation traditionally done by cardiologists.&#x201D; Doctors lamented that they spent &#x201C;a decade in medical training learning the art of diagnosis and treatment,&#x201D; and were now easily surpassed by computers [<xref ref-type="bibr" rid="ref-81">81</xref>]. &#x201C;The use of artificial intelligence is proliferating in American health care&#x2014;outpacing the development of government regulation. From diagnosing patients to policing drug theft in hospitals, AI has crept into nearly every facet of the health-care system, eclipsing the use of machine intelligence in other industries&#x201D; [<xref ref-type="bibr" rid="ref-82">82</xref>].</p>
<p><inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mo>&#x22C6;</mml:mo></mml:math></inline-formula> In micro-lending, AI has helped the Chinese company SmartFinance reduce the default rates of more than 2 millions loans per month to low single digits, a track record that makes traditional brick-and-mortar banks extremely jealous&#x201D; [<xref ref-type="bibr" rid="ref-83">83</xref>].</p>
<p><inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mo>&#x22C6;</mml:mo></mml:math></inline-formula> In the popular TED talk &#x201C;How AI can save humanity&#x201D; [<xref ref-type="bibr" rid="ref-84">84</xref>], the speaker alluded to the above-mentioned 2006 breakthrough ([<xref ref-type="bibr" rid="ref-78">78</xref>], p. 18) that marked the beginning of the &#x201C;deep learning&#x201D; wave of AI research when he said:<xref ref-type="fn" rid="fn21"><sup>21</sup></xref><fn id="fn21"><label>21</label><p>At video time 1:51. In less than a year, this 2018 April TED talk had more than two million views as of 2019 March.</p></fn> &#x201C;About 10 years ago, the grand AI discovery was made by three North American scientists,<xref ref-type="fn" rid="fn22"><sup>22</sup></xref><fn id="fn22"><label>22</label><p>See Footnote <xref ref-type="fn" rid="fn18">18</xref> for the names of these three scientists.</p></fn> and it&#x2019;s known as deep learning&#x201D;.</p></list-item></list>
<p>Section <xref ref-type="sec" rid="s13">13</xref> provices a historical perspective on the development of AI, with additional details on current and future applications.</p>
<p>It was, however, disappointing that despite the above-mentioned exciting outcomes of AI, during the Covid-19 pandemic beginning in 2020,<xref ref-type="fn" rid="fn23"><sup>23</sup></xref><fn id="fn23"><label>23</label><p>&#x201C;The World Health Organization declares COVID-19 a pandemic&#x201D; on 2020 Mar 11, <ext-link ext-link-type="uri" xlink:href="https://www.cdc.gov/museum/timeline/covid19.html">CDC Museum COVID-19 Timeline</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220602020133/https://www.cdc.gov/museum/timeline/covid19.html">Internet archive 2022.06.02</ext-link>.</p></fn> none of the hundreds of AI systems developed for Covid-19 diagnosis were usable for clinical applications; see Section <xref ref-type="sec" rid="s13_5_1">13.5.1</xref>. As of June 2022, the Tesla electric vehicle autopilot system is under increased scrutiny by the National Highway Traffic Safety Administration as there were &#x201C;16 crashes into emergency vehicles and trucks with warning signs, causing 15 injuries and one death.&#x201D;<xref ref-type="fn" rid="fn24"><sup>24</sup></xref><fn id="fn24"><label>24</label><p>Krisher T., <ext-link ext-link-type="uri" xlink:href="https://apnews.com/article/technology-health-business-8fc617fc492847d15bf253558ed1f925">Teslas with Autopilot a step closer to recall after wrecks</ext-link>, Associated Press, 2022.06.10.</p></fn> In addition, there are many limitations and danger in the current state-of-the-art of AI; see Section <xref ref-type="sec" rid="s14">14</xref>.</p>
<sec id="s2_1"><label>2.1</label>
<title>Handwritten equation to LaTeX code, image recognition</title>
<p>An image-recognition software useful for computational mechanicists is <ext-link ext-link-type="uri" xlink:href="https://mathpix.com/">Mathpix Snip</ext-link>,<xref ref-type="fn" rid="fn25"><sup>25</sup></xref><fn id="fn25"><label>25</label><p>We thank Kerem Uguz for informing the senior author LVQ about Mathpix.</p></fn> which recognizes hand-written math equations, and transforms them into LaTex codes. For example, <ext-link ext-link-type="uri" xlink:href="https://mathpix.com/">Mathpix Snip</ext-link> transforms the hand-written equation below by an 11-year old pupil:</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption><title><italic>Handwritten equation 1</italic> (Section <xref ref-type="sec" rid="s2_1">2.1</xref>)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-4.tif"/>
</fig>
<p>into this LaTeX code &#x201C;<monospace>p \times q = m \Rightarrow p = \frac { m } { q }</monospace>&#x201D; to yield the equation image: 
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mi>m</mml:mi><mml:mi>q</mml:mi></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Another example is the hand-written multiplication work below by the same pupil:</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption><title><italic>Handwritten equation 2</italic> (Section <xref ref-type="sec" rid="s2_1">2.1</xref>). Hand-written multiplication work of an eleven-year old pupil.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-5.tif"/>
</fig>
<p>that <ext-link ext-link-type="uri" xlink:href="https://mathpix.com/">Mathpix Snip</ext-link> transformed into the equation image below:<xref ref-type="fn" rid="fn26"><sup>26</sup></xref><fn id="fn26"><label>26</label><p><ext-link ext-link-type="uri" xlink:href="https://mathpix.com/">Mathpix Snip</ext-link> &#x201C;misunderstood&#x201D; that the top horizontal line was part of a fraction, and upon correction of this &#x201C;misunderstanding&#x201D; and font-size adjustment yielded the equation image shown in Eq. (<xref ref-type="disp-formula" rid="eqn-2">2</xref>).</p></fn>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:mtable columnalign="right" rowspacing="4pt" columnspacing="1em" rowlines="none solid none solid"><mml:mtr><mml:mtd><mml:mrow><mml:mn>97</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>66</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mn>1582</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mo>+</mml:mo><mml:mn>5820</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>6402</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</p>
</sec>
<sec id="s2_2"><label>2.2</label>
<title>Artificial intelligence, machine learning, deep learning</title>
<p>We want to immediately clarify the meaning of the terminologies &#x201C;Artificial Intelligence&#x201D; (AI), &#x201C;Machine Learning&#x201D; (ML), and &#x201C;Deep Learning&#x201D; (DL), since their casual use could be confusing for first-time learners.</p>
<p>For example, it was stated in a review of primarily two computer-science topics called &#x201C;Neural Networks&#x201D; (NNs) and &#x201C;Support Vector Machines&#x201D; (SVMs) and a physics topic that [<xref ref-type="bibr" rid="ref-85">85</xref>]:<xref ref-type="fn" rid="fn27"><sup>27</sup></xref><fn id="fn27"><label>27</label><p>We are only concerned with NNs, not SVMs, in the present paper.</p></fn></p>
<disp-quote><p>&#x201C;The respective underlying fields of basic research&#x2014;quantum information versus machine learning (ML) and artificial intelligence (AI)&#x2014;have their own specific questions and challenges, which have hitherto been investigated largely independently.&#x201D;</p>
</disp-quote><p>Questions would immediately arise in the mind of first-time learners: Are ML and AI two different fields, or the same fields with different names? If one field is a subset of the other, then would it be more general to just refer to the larger set? On the other hand, would it be more specific to just refer to the subset?</p>
<p>In fact, Deep Learning is a subset of methods inside a larger set of methods known as Machine Learning, which in itself is a subset of methods generally known as Artificial Intelligence. In other words, Deep Learning is Machine Learning, which is Artificial Intelligence; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 9.<xref ref-type="fn" rid="fn28"><sup>28</sup></xref><fn id="fn28"><label>28</label><p>References to books are accompanied with page numbers for specific information cited here so readers don&#x2019;t waste time to wade through an 800-page book to look for such information.</p></fn> On the other hand, Artificial Intelligence is not necessarily Machine Learning, which in itself is not necessarily Deep Learning.</p>
<p>The review in [<xref ref-type="bibr" rid="ref-85">85</xref>] was restricted to Neural Networks (which could be deep or shallow)<xref ref-type="fn" rid="fn29"><sup>29</sup></xref><fn id="fn29"><label>29</label><p>Network depth and size are discussed in Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref>. An example of a shallow network with one hidden layer can be found in Section <xref ref-type="sec" rid="s12_4">12.4</xref> on nonlinear-manifold model-order reduction applied to fluid mechanics.</p></fn> and Support Vector Machine (which is Machine Learning, but not Deep Learning); see Figure <xref ref-type="fig" rid="fig-6">6</xref>. Deep Learning can be thought of as multiple levels of composition, going from simpler (less abstract) concepts (or representations) to more complex (abstract) concepts (or representations).<xref ref-type="fn" rid="fn30"><sup>30</sup></xref><fn id="fn30"><label>30</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 5, p. 8, p. 14.</p></fn></p>
<p>Based on the above relationship between AI, ML, and DL, it would be much clearer if the phrase &#x201C;machine learning (ML) and artificial intelligence (AI)&#x201D; in both the title of [<xref ref-type="bibr" rid="ref-85">85</xref>] and the original sentence quoted above is replaced by the phrase &#x201C;machine learning (ML)&#x201D; to be more specific, since the authors mainly reviewed Multi-Layer Neural (MLN) networks (deep learning, and thus machine learning) and Support Vector Machine (machine learning).<xref ref-type="fn" rid="fn31"><sup>31</sup></xref><fn id="fn31"><label>31</label><p>For more on Support Vector Machine (SVM), see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 137. In the early 1990s, SVM displaced neural networks with backpropagation as a better method for the machine-learning community ([<xref ref-type="bibr" rid="ref-79">79</xref>], interview with G. Hinton). The resurgence of AI due to advances in deep learning started with the seminal paper [<xref ref-type="bibr" rid="ref-86">86</xref>], in which the authors demonstrated via numerical experiments that MLN network was better than SVM in terms of error in the handwriting-recognition benchmark test using the <ext-link ext-link-type="uri" xlink:href="http://yann.lecun.com/exdb/mnist/">MNIST handwritten digit database</ext-link>, which contains &#x201C;a training set of 60,000 examples, and a test set of 10,000 examples.&#x201D; But kernel methods studied for the development of SVM have now been used in connection with networks with infinite width to understand how deep learning works; see Section <xref ref-type="sec" rid="s8">8</xref> on &#x201C;Kernel machines&#x201D; and Section <xref ref-type="sec" rid="s14_2">14.2</xref> on &#x201C;Lack of understanding.&#x201D;</p></fn> MultiLayer Neural (MLN) network is also known as MultiLayer Perceptron (MLP).<xref ref-type="fn" rid="fn32"><sup>32</sup></xref><fn id="fn32"><label>32</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 5.</p></fn> both MLN networks and SVMs are considered as artificial intelligence, which in itself is too broad and thus not specific enough.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption><title><italic>Artificial intelligence and subfields</italic> (Section <xref ref-type="sec" rid="s2_2">2.2</xref>). Three classes of methods&#x2014;<italic>Artificial Intelligence</italic> (AI), <italic>Machine Learning</italic> (ML), and <italic>Deep Learning</italic> (DL)&#x2014;and their relationship, with an example of method in each class. A knowledge-base method is an AI method, but is neither a ML method, nor a DL method. Support Vector Machine and spiking computing are ML methods, and thus AI methods, but not a DL method. Multi-Layer Neural (MLN) network is a DL method, and is thus both an ML method and an AI method. See also Figure <xref ref-type="fig" rid="fig-158">158</xref> in Appendix <xref ref-type="sec" rid="s18">4</xref> on <italic>Cybernetics</italic>, which encompassed all of the above three classes.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-6.tif"/>
</fig>
<p>Another reason for simplifying the title in [<xref ref-type="bibr" rid="ref-85">85</xref>] is that the authors did not consider using any other AI methods, except for two specific ML methods, even though they discussed AI in the general historical context.</p>
<p>The engine of neuromorphic computing, also known as spiking computing, is a hardware network built into the IBM TrueNorth chip, which contains &#x201C;1 million programmable spiking neurons and 256 million configurable synapses&#x201D;,<xref ref-type="fn" rid="fn33"><sup>33</sup></xref><fn id="fn33"><label>33</label><p>The neurons are the computing units, and the synapses the memory instead of grouping the computing units into a central processing unit (CPU), separated from the memory, and connect the CPU and the memory via a bus, which creates a communication bottleneck, like the brain, each neuron in the TrueNorth chip has its own synapses (local memory).</p></fn> and consumes &#x201C;extremely low power&#x201D; [<xref ref-type="bibr" rid="ref-87">87</xref>]. Despite the apparent difference with the software approach of deep computing, neuromorphic chip could implement deep-learning networks, and thus the difference was not fundamental [<xref ref-type="bibr" rid="ref-88">88</xref>]. There is thus an overlap between neuromorphic computing and deep learning, as shown in Figure <xref ref-type="fig" rid="fig-6">6</xref>, instead of two disconnected subfields of machine learning as reported in [<xref ref-type="bibr" rid="ref-20">20</xref>].<xref ref-type="fn" rid="fn34"><sup>34</sup></xref><fn id="fn34"><label>34</label><p>In [<xref ref-type="bibr" rid="ref-20">20</xref>], there was only a reference to [<xref ref-type="bibr" rid="ref-87">87</xref>], but not to [<xref ref-type="bibr" rid="ref-88">88</xref>]. It is likely that the authors of [<xref ref-type="bibr" rid="ref-20">20</xref>] were not aware of [<xref ref-type="bibr" rid="ref-88">88</xref>], and thus an intersection between neuromorphic computing and deep learning.</p></fn></p>
</sec>
<sec id="s2_3"><label>2.3</label>
<title>Motivation, applications to mechanics</title>
<p>As motivation, we present in this section the results in three recent papers in computational mechanics, mentioned in the Opening Remarks in Section <xref ref-type="sec" rid="s1">1</xref>, and identify some deep-learning fundamental concepts (in <italic>italics</italic>) employed in these papers, together with the corresponding sections in the present paper where these concepts are explained in detail. First-time learners of deep learning likely find these fundamental concepts described by obscure technical jargon, whose meaning will be explained in details in the identified subsequent sections. Experts of deep learning would understand how deep learning is applied to computational mechanics.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption><title><italic>Feedforward neural network</italic> (Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>). A feedforward neural network in [<xref ref-type="bibr" rid="ref-38">38</xref>], rotated clockwise by 90 degrees to compare to its equivalent in Figure <xref ref-type="fig" rid="fig-23">23</xref> and Figure <xref ref-type="fig" rid="fig-35">35</xref> further below. All terminologies and fundamental concepts will be explained in detail in subsequent sections as <xref ref-type="fig" rid="fig-10">listed</xref>. See Section <xref ref-type="sec" rid="s4_1_1">4.1.1</xref> for a top-down explanation and Section <xref ref-type="sec" rid="s4_1_2">4.1.2</xref> for bottom-up explanation. This figure of a network could be confusing to first-time learners, as already indicated in Footnote <xref ref-type="fn" rid="fn5">5</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-7.tif"/>
</fig>
<sec id="s2_3_1"><label>2.3.1</label>
<title>Enhanced numerical quadrature for finite elements</title>
<p>To integrate efficiently and accurately the element matrices in a general finite element mesh of 3-D hexahedral elements (including distorted elements), the power of Deep Learning was harnessed in two applications of <italic>feedforward MultiLayer Neural networks</italic> (MLN,<xref ref-type="fn" rid="fn35"><sup>35</sup></xref><fn id="fn35"><label>35</label><p>MLN is also called MultiLayer Perceptron (MLP); see Footnote <xref ref-type="fn" rid="fn32">32</xref>.</p></fn> Figures <xref ref-type="fig" rid="fig-7">7</xref>-<xref ref-type="fig" rid="fig-8">8</xref>, Section <xref ref-type="sec" rid="s4">4</xref>) [<xref ref-type="bibr" rid="ref-38">38</xref>]:</p>
<list id="L1" list-type="simple">
<list-item><p>(1) Application 1.1: For each element (particularly distorted elements), find the number of integration points that provides accurate integration within a given error tolerance. Section <xref ref-type="sec" rid="s10_2">10.2</xref> contains the details.</p></list-item>
<list-item><p>(2) Application 1.2: Uniformly use <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> integration points for all elements, distorted or not, and find the appropriate quadrature weights<xref ref-type="fn" rid="fn36"><sup>36</sup></xref><fn id="fn36"><label>36</label><p>The <italic>quadrature</italic> weights at integration points are not to be confused with the <italic>network</italic> weights in a MLN network.</p></fn> (<italic>different</italic> from the traditional quadrature weights of the Gauss-Legendre method) at these integration points. Section <xref ref-type="sec" rid="s10_3">10.3</xref> contains the details.</p></list-item></list>
<p>To <italic>train</italic><xref ref-type="fn" rid="fn37"><sup>37</sup></xref><fn id="fn37"><label>37</label><p>See Section <xref ref-type="sec" rid="s6">6</xref> on &#x201C;Network training, optimization methods&#x201D;.</p></fn> the networks&#x2014;i.e., to optimize the network parameters (weights and biases, Figure <xref ref-type="fig" rid="fig-8">8</xref>) to minimize some <italic>loss (cost, error) function</italic> (Sections <xref ref-type="sec" rid="s5_1">5.1</xref>, <xref ref-type="sec" rid="s6">6</xref>)&#x2014;up to 20000 randomly distorted hexahedrals were generated by displacing nodes from a regularly shaped element [<xref ref-type="bibr" rid="ref-38">38</xref>], see Figure <xref ref-type="fig" rid="fig-9">9</xref>. For each distorted shape, the following are determined: (<xref ref-type="list" rid="L1">1</xref>) the minimum number of integration points required to reach a prescribed accuracy, and (<xref ref-type="list" rid="L1">2</xref>) corrections to the quadrature weights by trying one million randomly generated sets of correction factors, among which the best one was retained.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption><title><italic>Artificial neuron</italic> (Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>). A neuron with its multiple inputs <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> (which are outputs from the previous layer <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and thus the variable name &#x201C;<inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>O</mml:mi></mml:math></inline-formula>&#x201D;), processing operations (multiply inputs with network weights <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula>, sum weighted inputs, add bias <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mi>j</mml:mi><mml:mi>p</mml:mi></mml:msubsup></mml:math></inline-formula>, activation function <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>f</mml:mi></mml:math></inline-formula>), and single output <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msubsup><mml:mi>O</mml:mi><mml:mi>j</mml:mi><mml:mi>p</mml:mi></mml:msubsup></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-38">38</xref>]. See the equivalent Figure <xref ref-type="fig" rid="fig-36">36</xref>, Section <xref ref-type="sec" rid="s4_4_3">4.4.3</xref> further below. All terminologies and fundamental concepts will be explained in detail in subsequent sections as <xref ref-type="fig" rid="fig-10">listed</xref>. See Section <xref ref-type="sec" rid="s4">4</xref> on feedforward networks, Section <xref ref-type="sec" rid="s4_1_1">4.1.1</xref> on top-down explanation and Section <xref ref-type="sec" rid="s4_1_2">4.1.2</xref> on bottom-up explanation. This figure of a neuron could be confusing to first-time learners, as already indicated in Footnote <xref ref-type="fn" rid="fn5">5</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-8.tif"/>
</fig><fig id="fig-9">
<label>Figure 9</label>
<caption><title><italic>Cube and distorted cube elements</italic> (Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>). Regular and distorted linear hexahedral elements [<xref ref-type="bibr" rid="ref-38">38</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-9.tif"/>
</fig>
<p>While Application 1.1 used one <italic>fully-connected</italic> (Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref>) feedforward neural network (Section <xref ref-type="sec" rid="s4">4</xref>), Application 1.2 relied on two neural networks: The first neural network was a classifier that took the element shape (18 normalized nodal coordinates) as input and estimated whether or not the numerical integration (quadrature) could be improved by adjusting the quadrature weights for the given element (one output), i.e., the network classifier only produced two outcomes, yes or no. If an error reduction was possible, a second neural network performed regression to predict the corrected quadrature weights (eight outputs for <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> quadrature) from the input element shape (usually distorted).</p>
<p>To train the classifier network, 10,000 element shapes were selected from the prepared dataset of 20,000 hexahedrals, which were divided into a <italic>training set</italic> and a <italic>validation set</italic> (Section <xref ref-type="sec" rid="s6_1">6.1</xref>) of 5000 elements each.<xref ref-type="fn" rid="fn38"><sup>38</sup></xref><fn id="fn38"><label>38</label><p>For the definition of training set and test set, see Section <xref ref-type="sec" rid="s6_1">6.1</xref>. Briefly, the training set is used to optimize the network parameters, while the test set is used to see how good the network with these optimized parameters can predict the targets of never-seen-before inputs.</p></fn></p>
<p>To train the second regression network, 10,000 element shapes were selected for which quadrature could be improved by adjusting the quadrature weights [<xref ref-type="bibr" rid="ref-38">38</xref>].</p>
<p>Again, the training set and the test set comprised 5000 elements each. The parameters of the neural networks (<italic>weights, biases</italic>, Figure <xref ref-type="fig" rid="fig-8">8</xref>, Section <xref ref-type="sec" rid="s4_4">4.4</xref>) were optimized (trained) using a <italic>gradient descent method</italic> (Section <xref ref-type="sec" rid="s6">6</xref>) that minimizes a <italic>loss function</italic> (Section <xref ref-type="sec" rid="s5_1">5.1</xref>), whose gradients with respect to the parameters are computed using <italic>backpropagation</italic> (Section <xref ref-type="sec" rid="s5">5</xref>).</p>

<fig id="fig-10">
<label>Figure 10</label>
<caption><title><italic>Effectiveness of quadrature weight prediction</italic> (Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>). Subfigure (a): Distribution of error-reduction ratio <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> for the 5,000 elements in the training set. The red bars (&#x201C;Optimized&#x201D;) are the error ratios obtained from the optimal weights (found by a large number of trials-and-errors) that were used to train the network. The blue bars (&#x201C;Estimated by Neuro&#x201D;) are the error ratios obtained from the trained neural network. <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&lt;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> indicates improved quadrature accuracy. As a result of using the optimal weights, there were no red bars with <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&gt;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. That there were very few blue bars with <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&gt;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> showed that the proposed method worked in reducing the integration error in more than 97% of the elements. Subfigure (b): Error ratios for the test set with 5000 elements [<xref ref-type="bibr" rid="ref-38">38</xref>]. More detailed explanation is provided in Section <xref ref-type="sec" rid="s10_3_3">10.3.3</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-10.tif"/>
</fig>
<p>The best results were obtained from a classifier with four <italic>hidden layers</italic> (Figure <xref ref-type="fig" rid="fig-7">7</xref>, Section <xref ref-type="sec" rid="s4_3">4.3</xref>, Remark <xref ref-type="statement" rid="st4_4">4.2</xref>) with 30 <italic>neurons</italic> (Figure <xref ref-type="fig" rid="fig-8">8</xref>, Figure <xref ref-type="fig" rid="fig-36">36</xref>, Section <xref ref-type="sec" rid="s4_4_3">4.4.3</xref>) each and a regression network that had a depth of five hidden layers, where each layer was 50 neurons wide, Figure <xref ref-type="fig" rid="fig-7">7</xref>. The results were obtained using the <italic>logistic sigmoid function</italic> (Figure <xref ref-type="fig" rid="fig-30">30</xref>) as <italic>activation function</italic> (Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>) due to existing software, even though the <italic>rectified linear function</italic> (Figure <xref ref-type="fig" rid="fig-24">24</xref>) were more efficient, but yielded comparable accuracy on a few test cases.<xref ref-type="fn" rid="fn39"><sup>39</sup></xref><fn id="fn39"><label>39</label><p>Information provided by author A. Oishi of [<xref ref-type="bibr" rid="ref-38">38</xref>] through a private communication to the authors on 2018 Nov 16.</p></fn></p>
<p>To quantify the effectiveness of the approach in [<xref ref-type="bibr" rid="ref-38">38</xref>], an error-reduction ratio was introduced, i.e., the quotient of the quadrature error with quadrature weights predicted by the neural network and the error obtained with the standard quadrature weights of Gauss-Legendre quadrature with <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> integration points; see Eq. (<xref ref-type="disp-formula" rid="eqn-402">402</xref>) in Section <xref ref-type="sec" rid="s10_3">10.3</xref> with <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> and &#x201C;<inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula>&#x201D; stands for &#x201C;optimized&#x201D; (or &#x201C;predicted&#x201D;). When the error-reduction ratio is less than 1, the integration using the predicted quadrature weights is more accurate than that using the standard quadrature weights. To compute the two quadrature errors mentioned above (one for the predicted quadrature weights and one for the standard quadrature weights, both for the same <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> integration points), the reference values considered as most accurate were obtained using <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mn>30</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>30</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>30</mml:mn></mml:math></inline-formula> integration points with the standard quadrature quadrature weights; see Eq. (<xref ref-type="disp-formula" rid="eqn-401">401</xref>) in Section <xref ref-type="sec" rid="s10_2">10.2</xref> with <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>30</mml:mn></mml:math></inline-formula>.</p>
<p>For most element shapes of both the training set (a) and the test set (b), each of which comprised 5000 elements, the blue bars in Figure <xref ref-type="fig" rid="fig-10">10</xref> indicate an error ratio below one, i.e., the quadrature weight correction effectively improved the accuracy of numerical quadrature.</p>
<p>Readers familiar with Deep Learning and neural networks can go directly to Section <xref ref-type="sec" rid="s10">10</xref>, where the details of the formulations in [<xref ref-type="bibr" rid="ref-38">38</xref>] are presented. Other sections are also of interest such as classic and state-of-the-art optimization methods in Section <xref ref-type="sec" rid="s6">6</xref>, attention and transformer unit in Section <xref ref-type="sec" rid="s7">7</xref>, historical perspective in Section <xref ref-type="sec" rid="s13">13</xref>, limitations and danger of AI in Section <xref ref-type="sec" rid="s14">14</xref>.</p>
<p>Readers not familiar with Deep Learning and neural networks will find below a list of the concepts that will be explained in subsequent sections. To facilitate the reading, we also provide the section number (and the link to jump to) for each concept.</p>
<p><bold>Deep-learning concepts to explain and explore:</bold></p>
<list list-type="simple">
<list-item><p>(1) Feedforward neural network (Figure <xref ref-type="fig" rid="fig-7">7</xref>): Figure <xref ref-type="fig" rid="fig-23">23</xref> and Figure <xref ref-type="fig" rid="fig-35">35</xref>, Section <xref ref-type="sec" rid="s4">4</xref></p>
</list-item>
<list-item><p>(2) Neuron (Figure <xref ref-type="fig" rid="fig-8">8</xref>): Figure <xref ref-type="fig" rid="fig-36">36</xref> in Section <xref ref-type="sec" rid="s4_4_4">4.4.4</xref> (artificial neuron), and Figure <xref ref-type="fig" rid="fig-131">131</xref> in Section <xref ref-type="sec" rid="s13_1">13.1</xref> (biological neuron)</p>
</list-item>
<list-item><p>(3) Inputs, output, hidden layers, Section <xref ref-type="sec" rid="s4_3">4.3</xref></p></list-item>
<list-item><p>(4) Network depth and width: Section <xref ref-type="sec" rid="s4_3">4.3</xref></p></list-item>
<list-item><p>(5) Parameters, weights, biases <xref ref-type="sec" rid="s4_4_1">4.4.1</xref></p></list-item>
<list-item><p>(6) Activation functions: Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref></p></list-item>
<list-item><p>(7) What is &#x201C;deep&#x201D; in &#x201C;deep networks&#x201D; ? Size, architecture, Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref>, Section <xref ref-type="sec" rid="s4_6_2">4.6.2</xref></p></list-item>
<list-item><p>(8) Backpropagation, computation of gradient: Section <xref ref-type="sec" rid="s5">5</xref></p></list-item>
<list-item><p>(9) Loss (cost, error) function, Section <xref ref-type="sec" rid="s5_1">5.1</xref></p></list-item>
<list-item><p>(10) Training, optimization, stochastic gradient descent: Section <xref ref-type="sec" rid="s6">6</xref></p></list-item>
<list-item><p>(11) Training error, validation error, test (or generalization) error: Section <xref ref-type="sec" rid="s6_1">6.1</xref></p></list-item></list>
<p>This list is continued further <xref ref-type="fig" rid="fig-17">below</xref> in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>. Details of the formulation in [<xref ref-type="bibr" rid="ref-38">38</xref>] are discussed in Section <xref ref-type="sec" rid="s10">10</xref>.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption><title><italic>Dual-porosity single-permeability medium</italic> (Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>). <italic>Left</italic>: Actual reservoir. Dual (or double) porosity indicates the presence of two types of porosity in naturally-fractured reservoirs (e.g., of oil): (1) Primary porosity in the matrix (e.g., voids in sands) with low permeability, within which fluid does not flow, (2) Secondary porosity due to fractures and vugs (cavities in rocks) with high (anisotropic) permeability, within which fluid flows. Fluid exchange is permitted between the matrix and the fractures, but not between the matrix blocks (sugar cubes), of which the permeability is much smaller than in the fractures. <italic>Right</italic>: Model reservoir, idealization. The primary porosity is an array of cubes of homogeneous, isotropic material. The secondary porosity is an &#x201C;orthogonal system of continuous, uniform fractures&#x201D;, oriented along the principal axes of anisotropic permeability [<xref ref-type="bibr" rid="ref-89">89</xref>]. (Figure reproduced with permission of the publisher SPE.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-11.tif"/>
</fig>
</sec>
<sec id="s2_3_2"><label>2.3.2</label>
<title>Solid mechanics, multiscale modeling</title>
<p>One way that deep learning can be used in solid mechanics is to model complex, nonlinear constitutive behavior of materials. In single physics, balance of linear momentum and strain-displacement relation are considered as definitions or &#x201C;universal principles&#x201D;, leaving the constitutive law, or stress-strain relation, to a large number of models that have limitations, no matter how advanced [<xref ref-type="bibr" rid="ref-91">91</xref>]. Deep learning can help model complex constitutive behaviors in ways that traditional phenomenological models could not; see Figure <xref ref-type="fig" rid="fig-105">105</xref>.</p>
<p>Deep <italic>recurrent neural networks</italic> (RNNs) (Section <xref ref-type="sec" rid="s7_1">7.1</xref>) was used as a scale-bridging method to efficiently simulate multiscale problems in hydromechanics, specifically plasticity in porous media with dual porosity <italic>and</italic> dual permeability [<xref ref-type="bibr" rid="ref-25">25</xref>].<xref ref-type="fn" rid="fn40"><sup>40</sup></xref><fn id="fn40"><label>40</label><p>Porosity is the ratio of void volume over total volume. Permeability is a scaling factor, which when multiplied by the negative of the pressure gradient, and divided by the fluid dynamic viscosity, gives the fluid velocity in Darcy&#x2019;s law, Eq. (<xref ref-type="disp-formula" rid="eqn-409">409</xref>). The expression &#x201C;dual-porosity dual-permeability poromechanics problem&#x201D; used in [<xref ref-type="bibr" rid="ref-25">25</xref>], p. 340, could confuse first-time readers&#x2014;especially those who are familiar with traditional reservoir simulation, e.g., in [<xref ref-type="bibr" rid="ref-92">92</xref>]&#x2014;since dual porosity (also called &#x201C;double porosity&#x201D; in [<xref ref-type="bibr" rid="ref-89">89</xref>]) and dual permeability are two different models of naturally-fractured porous media; these two models for radionuclide transport around nuclear waste repository were studied in [<xref ref-type="bibr" rid="ref-93">93</xref>]. Further added to the confusion is that the dual-porosity model is more precisely called <italic>dual-porosity-single permeability</italic> model, whereas the dual-permeability model is called <italic>dual-porosity dual-permeability</italic> model [<xref ref-type="bibr" rid="ref-94">94</xref>], which has a different meaning than the one used in [<xref ref-type="bibr" rid="ref-25">25</xref>].</p></fn></p>
<p>The <italic>dual-porosity single-permeability</italic> (DPSP) model was first introduced for use in oil-reservoir simulation [<xref ref-type="bibr" rid="ref-89">89</xref>], Figure <xref ref-type="fig" rid="fig-11">11</xref>, where the fracture system was the main flow path for the fluid (e.g., two phase oil-water mixture, one-phase oil-solvent mixture). Fluid exchange is permitted between the rock matrix and the fracture system, but not between the matrix blocks. In the DPSP model, the fracture system and the rock matrix, each has its own porosity, with values not differing from each other by a large factor. On the contrary, the permeability of the fracture system is much larger than that in the rock matrix, and thus the system is considered as having only a single permeability. When the permeability of the fracture system and that of the rock matrix do not differ by a large factor, then both permeabilities are included in the more general <italic>dual-porosity dual-permeability</italic> (DPDP) model [<xref ref-type="bibr" rid="ref-94">94</xref>].</p>
<fig id="fig-12">
<label>Figure 12</label>
<caption><title><italic>Pore structure of Majella limestone, dual porosity</italic> (Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>), a carbonate rock with high total porisity at 30%. Backscattered SEM images of Majella limestone: (a)-(c) sequence of zoomed-ins; (d) zoomed-out. (a) The larger macropores (dark areas) have dimensions comparable to the grains (allochems), having an average diameter of 54 <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula>m, with macroporosity at 11.4%. (b) Micropores embedded in the grains and cemented regions, with microporosity at 19.6%, which is equal to the total porosity at 30% minus the macroporosity. (c) Numerous micropores in the periphery of a macropore. (d) Map performed manually under optical microscope showing the partitioning of grains, matrix (mostly cement) and porosity [<xref ref-type="bibr" rid="ref-90">90</xref>]. See Section <xref ref-type="sec" rid="s11">11</xref> and Remark <xref ref-type="statement" rid="st11_8">11.8</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-12.tif"/>
</fig>
<p>Since 60% of the world&#x2019;s oil reserve and 40% of the world&#x2019;s gas reserve are held in carbonate rocks, there has been a clear interest in developing an understanding of the mechanical behavior of carbonate rocks such as limestones, having from lowest porosity (Solenhofen at 3%) to high porosity (e.g., Majella at 30%). Chalk (Lixhe) is a carbonate rock with highest porosity at 42.8%. Carbonate rock reservoirs are also considered to store carbon dioxide and nuclear waste [<xref ref-type="bibr" rid="ref-95">95</xref>] [<xref ref-type="bibr" rid="ref-93">93</xref>].</p>
<p>In oil-reservoir simulations in which the primary interest is the flow of oil, water, and solvent, the porosity (and pore size) within each domain (rock matrix or fracture system) is treated as constant and homogeneous [<xref ref-type="bibr" rid="ref-94">94</xref>] [<xref ref-type="bibr" rid="ref-96">96</xref>].<xref ref-type="fn" rid="fn41"><sup>41</sup></xref><fn id="fn41"><label>41</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-94">94</xref>], p. 295, Chap. 9 on &#x201C;Advanced Topics: Fluid Flow in Fractured Reservoirs and Compositional Simulation&#x201D;.</p></fn> On the other hand, under mechanical stress, the pore size would change, cracks and other defects would close, leading to a change in the porosity in carbonate rocks. Indeed, &#x201C;at small stresses, experimental mechanical deformation of carbonate rock is usually characterized by a non-linear stress-strain relationship, interpreted to be related to the closure of cracks, pores, and other defects. The non-linear stress-strain relationship can be related to the amount of cracks and various type of pores&#x201D; [<xref ref-type="bibr" rid="ref-95">95</xref>], p. 202. Once the pores and cracks are closed, the stress-strain relation becomes linear, at different stress stages, depending on the initial porosity and the geometry of the pore space [<xref ref-type="bibr" rid="ref-95">95</xref>].</p>
<fig id="fig-13">
<label>Figure 13</label>
<caption><title><italic>Majella limestone, nonlinear stress-strain relations</italic> (Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>). Differential stress (i.e., the difference between the largest principal stress and the smallest one) vs axial strain (left) and vs volumetric strain (right) [<xref ref-type="bibr" rid="ref-90">90</xref>]. See Remark <xref ref-type="statement" rid="st11_7">11.7</xref>, Section <xref ref-type="sec" rid="s11_3_4">11.3.4</xref>, and Remark <xref ref-type="statement" rid="st11_10">11.10</xref>, Section <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-13.tif"/>
</fig>
<p>Moreover, pores have different sizes, and can be classified into different pore sub-systems. For the Majella limestone in Figure <xref ref-type="fig" rid="fig-12">12</xref> with total porosity at 30%, its pore space can be partitioned into two subsystems (and thus dual porosity), the macropores with macroporosity at 11.4% and the micropores with microporosity at 19.6%. Thus the meaning of dual-porosity as used in [<xref ref-type="bibr" rid="ref-25">25</xref>] is different from that in oil-reservoir simulation. Also characteristic of porous rocks such as the Majella limestone is the non-linear stress-strain relation observed in experiments, Figure <xref ref-type="fig" rid="fig-13">13</xref>, due the changing size, and collapse, of the pores.</p>
<p>Likewise, the meaning of &#x201C;dual permeability&#x201D; is different in [<xref ref-type="bibr" rid="ref-25">25</xref>] in the sense that &#x201C;one does not seek to obtain a single effective permeability for the entire pore space&#x201D;. Even though it was not explicitly spelled out,<xref ref-type="fn" rid="fn42"><sup>42</sup></xref><fn id="fn42"><label>42</label><p>At least at the beginning of Section 2 in [<xref ref-type="bibr" rid="ref-25">25</xref>].</p></fn> it appears that each of the two pore sub-systems would have its own permeability, and that fluid is allowed to exchange between the two pore sub-systems, similar to the fluid exchange between the rock matrix and the fracture system in the DPSP and DPDP models in oil-reservoir simulation [<xref ref-type="bibr" rid="ref-94">94</xref>].</p>
<p>In the problem investigated in [<xref ref-type="bibr" rid="ref-25">25</xref>], the presence of localized discontinuities demands three scales&#x2014;microscale (<inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula>), mesoscale (cm), macroscale (km)&#x2014;to be considered in the modeling, see Figure <xref ref-type="fig" rid="fig-14">14</xref>. Classical approaches to consistently integrate microstructural properties into macroscopic constitutive laws relied on hierarchical simulation models and homogenization methods (e.g., discrete element method (DEM)&#x2013;FEM coupling, FEM<inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:msup><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>). If more than two scales were to be considered, the computational complexity would become prohibitively, if not intractably, large.</p>
<p>Instead of coupling multiple simulation models online, two (adjacent) scales were linked by a neural network that was trained offline using data generated by simulations on the smaller scale [<xref ref-type="bibr" rid="ref-25">25</xref>]. The trained network subsequently served as a surrogate model in online simulations on the larger scale. With three scales being considered, two recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM) units were employed consecutively:
<list list-type="simple">
<list-item><p>(1) <italic>Mesoscale RNN with LSTM units:</italic> On the microscopic scale, a representative volume element (RVE) was an assembly of discrete-element particles, subjected to large variety of representative loading paths to generate training data for the supervised learning of the mesoscale RNN with LSTM units, a neural network that was referred to as &#x201C;Mesoscale data-driven constitutive model&#x201D; [<xref ref-type="bibr" rid="ref-25">25</xref>] (Figure <xref ref-type="fig" rid="fig-14">14</xref>). Homogenizing the results of DEM-flow model provided constitutive equations for the traction-separation law and the evolution of anisotropic permeabilities in damaged regions.</p></list-item>
<list-item><p>(2) <italic>Macroscale RNN with LSTM units:</italic> The mesoscale RVE (middle row in Figure <xref ref-type="fig" rid="fig-14">14</xref>), in turn, was a finite-element model of a porous material with embedded strong discontinuities equivalent to the fracture system in oil-reservoir simulation in Figure <xref ref-type="fig" rid="fig-11">11</xref>. The host matrix of the RVE was represented by an isotropic linearly elastic solid. In localized fracture zones within, the traction-separation law and the hydraulic response were provided by the mesoscale RNN with LSTM units developed above. Training data for the macroscale RNN with LSTM units&#x2014;a network referred to as &#x201C;Macroscale data-driven constitutive model&#x201D; [<xref ref-type="bibr" rid="ref-25">25</xref>]&#x2014;is generated by computing the (homogenized) response of the mesoscale RVE to various loadings. In macroscopic simulations, the mesoscale RNN with LSTM units provided the constitutive response at a sealing fault that represented a strong discontinuity.</p>
</list-item></list></p>
<fig id="fig-14">
<label>Figure 14</label>
<caption><title><italic>Hierarchy of a multi-scale multi-physics poromechanics problem for fluid-infiltrating media</italic> [<xref ref-type="bibr" rid="ref-25">25</xref>] (Sections <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, <xref ref-type="sec" rid="s11_1">11.1</xref>, <xref ref-type="sec" rid="s11_3_1">11.3.1</xref>, <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>, <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). Microscale (<inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula>), mesoscale (cm), macroscale (km). DEM RVE = Discrete Element Method Representative Volume Element. RNN-FEM = Recurrent Neural Network (Section <xref ref-type="sec" rid="s7_1">7.1</xref>) - Finite Element Method. LSTM = Long Short-Term Memory (Section <xref ref-type="sec" rid="s7_2">7.2</xref>). The mesoscale has embedded strong discontinuities equivalent to the fracture system in Figure <xref ref-type="fig" rid="fig-11">11</xref>. See Figure <xref ref-type="fig" rid="fig-104">104</xref>, where the orientations of the RVEs are shown, Figure <xref ref-type="fig" rid="fig-106">106</xref> for the microscale RVE (Remark <xref ref-type="statement" rid="st11_1">11.1</xref>) and Figure <xref ref-type="fig" rid="fig-113">113</xref> for the mesoscale RVE (Remark <xref ref-type="statement" rid="st11_9">11.9</xref>). (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-14.tif"/>
</fig>
<fig id="fig-15">
<label>Figure 15</label>
<caption><title><italic>LSTM variant with &#x201C;peephole&#x201D; connections</italic>, block diagram (Sections <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, <xref ref-type="sec" rid="s7_2">7.2</xref>).<xref ref-type="fn" rid="fn43"><sup>43</sup></xref><fn id="fn43"><label>43</label><p>The LSTM variant with peephole connections is not the original LSTM cell (Section <xref ref-type="sec" rid="s7_2">7.2</xref>); see, e.g, [<xref ref-type="bibr" rid="ref-97">97</xref>]. The equations describing the LSTM unit in [<xref ref-type="bibr" rid="ref-25">25</xref>], whose authors never mentioned the word &#x201C;peephole&#x201D;, correspond to the original LSTM without peepholes. It was likely a mistake to use this figure in [<xref ref-type="bibr" rid="ref-25">25</xref>].</p></fn> Unlike the original LSTM unit (see Section <xref ref-type="sec" rid="s7_2">7.2</xref>), both the input gate and the forget gate in an LSTM unit with peephole connections receive the cell state as input. The above figure from Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://commons.wikimedia.org/w/index.php?title=File:Long_Short_Term_Memory.png&amp;oldid=174444321">version 22:56, 4 October 2015</ext-link>, is identical to Figure <xref ref-type="fig" rid="fig-10">10</xref> in [<xref ref-type="bibr" rid="ref-25">25</xref>], whose authors erroneously used this figure without mentioning the source, but where the original LSTM unit <italic>without</italic> &#x201C;peepholes&#x201D; was actually used, and with the detailed block diagram in Figure <xref ref-type="fig" rid="fig-81">81</xref>, Section <xref ref-type="sec" rid="s7_2">7.2</xref>. See also Figure <xref ref-type="fig" rid="fig-82">82</xref> and Figure <xref ref-type="fig" rid="fig-117">117</xref> for the original LSTM unit applied to fluid mechanics. (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by-sa/4.0/deed.en">CC-BY-SA 4.0</ext-link>)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-15.tif"/>
</fig>
<p>Path-dependence is a common characteristic feature of the constitutive models that are often realized as neural networks; see, e.g., [<xref ref-type="bibr" rid="ref-23">23</xref>]. For this reason, it was decided to employ RNN with LSTM units, which could mimick internal variables and corresponding evolution equations that were intrinsic to path-dependent material behavior [<xref ref-type="bibr" rid="ref-25">25</xref>]. These authors chose to use a neural network that had a depth of two hidden layers with 80 LSTM units per layer, and that had proved to be a good compromise of performance and training efforts. After each hidden layer, a dropout layer with a dropout rate 0.2 were introduced to reduce overfitting on noisy data, but yielded minor effects, as reported in [<xref ref-type="bibr" rid="ref-25">25</xref>]. The output layer was a fully-connected layer with a logistic sigmoid as activation function.</p>
<p>An important observation is that including micro-structural data&#x2014;the porosity <inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula>, the coordination number <inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:math></inline-formula> (number of contact points, Figure <xref ref-type="fig" rid="fig-16">16</xref>), the fabric tensor (defined based on the normals at the contact points, Eq. (<xref ref-type="disp-formula" rid="eqn-404">404</xref>) in Section <xref ref-type="sec" rid="s11">11</xref>; Figure <xref ref-type="fig" rid="fig-16">16</xref> provides a visualization)&#x2014;as network inputs significantly improved the prediction capability of the neural network. Such improvement is not surprising since soil fabric&#x2014;described by scalars (porosity, coordination number, particle size) and vectors (fabric tensors, particle orientation, branch vectors)&#x2014;exerts great influence on soil behavior [<xref ref-type="bibr" rid="ref-99">99</xref>]. Coordination number<xref ref-type="fn" rid="fn44"><sup>44</sup></xref><fn id="fn44"><label>44</label><p>The <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Coordination_number&amp;oldid=970032729">coordination number</ext-link> (Wikipedia version 20:43, 28 July 2020) is a concept originated from chemistry, signifying the number of bonds from the surrounding atoms to a central atom. In Figure <xref ref-type="fig" rid="fig-16">16</xref> (a), the <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Uranium_borohydride&amp;oldid=887379908">uranium borohydride</ext-link> U(BH<sub>4</sub>)<sub>4</sub> complex (Wikipedia version 08:38, 12 March 2019) has 12 hydrogen atoms bonded to the central uranium atom.</p></fn> has been used to predict soil particle breakage [<xref ref-type="bibr" rid="ref-100">100</xref>], morphology and crushability [<xref ref-type="bibr" rid="ref-99">99</xref>], and in a study of internally-unstable soil involving a mixture of coarse and fine particles [<xref ref-type="bibr" rid="ref-101">101</xref>]. Fabric tensors, with theoretical foundation developed in [<xref ref-type="bibr" rid="ref-102">102</xref>], provide a mean to represent directional data such as normals at contact points, even though other types of directional data have been proposed to develop fabric tensors [<xref ref-type="bibr" rid="ref-103">103</xref>]. To model anisotropic behavior of granular materials, contact-normal fabric tensor was incorporated in an isotropic constitutive law.</p>
<fig id="fig-16">
<label>Figure 16</label>
<caption><title><italic>Coordination number CN</italic> (Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, <xref ref-type="sec" rid="s11_3_2">11.3.2</xref>). (a) Chemistry. Number of bonds to the central atom. Uranium borohydride U(BH<sub>4</sub>)<sub>4</sub> has <italic>CN</italic> = 12 hydrogen bonds to uranium. (b, c) Photoelastic discs showing number of contact points (coordination number) on a particle. (b) Random packing and force chains, different force directions along principal chains and in secondary particles. (c) Arches around large pores, precarious stability around pores. The coordination number for the large disc (particle) in red square is 5, but only 4 of those had nonzero contact forces based on the bright areas showing stress action. Figures (b, c) also provide a visualization of the &#x201C;flow&#x201D; of the contact normals, and thus the fabric tensor [<xref ref-type="bibr" rid="ref-98">98</xref>]. See also Figure <xref ref-type="fig" rid="fig-17">17</xref>. (Figure reproduced with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-16.tif"/>
</fig>
<p>Figure <xref ref-type="fig" rid="fig-17">17</xref> illustrates the importance of incorporating microstructure data, particularly the fabric tensor, in network training to improve prediction accuracy.</p>
<p><bold>Deep-learning concepts to explain and explore:</bold> (continued from <xref ref-type="fig" rid="fig-10">above</xref> in Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>)</p>
<list list-type="simple">
<list-item><p>(12) Recurrent neural network (RNN), Section <xref ref-type="sec" rid="s7_1">7.1</xref></p></list-item>
<list-item><p>(13) Long Short-Term Memory (LSTM), Section <xref ref-type="sec" rid="s7_2">7.2</xref></p></list-item>
<list-item><p>(14) Attention and Transformer, Section <xref ref-type="sec" rid="s7_4_3">7.4.3</xref></p></list-item>
<list-item><p>(15) Dropout layer and dropout rate,<xref ref-type="fn" rid="fn45"><sup>45</sup></xref><fn id="fn45"><label>45</label><p>Briefly, dropout means to drop or to remove non-output units (neurons) from a base network, thus creating an ensemble of sub-networks (or models) to be trained for each example, and can also be considered as a way to add noise to inputs, particularly of hidden layers, to train the base network, thus making it more robust, since neural networks were known to be not robust to noise. Adding noise is also equivalent to increasing the size of the dataset for training, [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 233, Section 7.4 on &#x201C;Dataset augmentation&#x201D;.</p></fn> which had minor effects in the particular work repoorted in [<xref ref-type="bibr" rid="ref-25">25</xref>], and thus will not be covered here. See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 251, Section 7.12.</p></list-item></list>
<p>Details of the formulation in [<xref ref-type="bibr" rid="ref-25">25</xref>] are discussed in Section <xref ref-type="sec" rid="s11">11</xref>.</p>
</sec>
<sec id="s2_3_3"><label>2.3.3</label>
<title>Fluid mechanics, reduced-order model for turbulence</title>
<p>The accurate simulation of turbulence in fluid flows ranks among the most demanding tasks in computational mechanics. Owing to both the spatial and the temporal resolution, transient analysis of turbulence by means of high-fidelity methods such as Large Eddy Simulation (LES) or direct numerical simulation (DNS) involves millions of unknowns even for simple domains.</p>
<p>To simulate complex geometries over larger time periods, reduced-order models (ROMs) that can capture the key features of turbulent flows within a low-dimensional approximation space need to be resorted to. Proper Orthogonal Decomposition (POD) is a common data-driven approach to construct an orthogonal basis <inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> from high-resolution data obtained from high-fidelity models or measurements, in which <inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> is a point in a 3-D fluid domain <inline-formula id="ieqn-157"><mml:math id="mml-ieqn-157"><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:math></inline-formula>; see Section <xref ref-type="sec" rid="s12_1">12.1</xref>. A flow dynamic quantity <inline-formula id="ieqn-158"><mml:math id="mml-ieqn-158"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, such as a component of the flow velocity field, can be projected on the POD basis by separation of variables as (Figure <xref ref-type="fig" rid="fig-18">18</xref>)</p>

<fig id="fig-17">
<label>Figure 17</label>
<caption><title><italic>Network with LSTM and microstructure data</italic> (porosity <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula>, coordination number <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula>, Figure <xref ref-type="fig" rid="fig-16">16</xref>, fabric tensor <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi>F</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi>F</mml:mi></mml:msub></mml:math></inline-formula>, Eq. (<xref ref-type="disp-formula" rid="eqn-404">404</xref>)) (Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, <xref ref-type="sec" rid="s11_3_2">11.3.2</xref>). Simple shear test using Discrete Element Method to provide network training data under loading-unloading conditions. ANN = Artificial Neural Network with no LSTM units. While network with LSTM units and (<inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi>F</mml:mi></mml:msub></mml:math></inline-formula>) improved the predicted traction, compared to network with LSTM units and only (<inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula>) or (<inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:math></inline-formula>); the latter two networks produced predicted traction that was worse compared to network with LSTM alone, indicating the important role of the fabric tensor, which contained directional data that were absent in scalar fields like (<inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:math></inline-formula>) [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-17.tif"/>
</fig>
<p><disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2248;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>k</mml:mi><mml:mo>&lt;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-159"><mml:math id="mml-ieqn-159"><mml:mi>k</mml:mi></mml:math></inline-formula> is a finite number, which could be large, and <inline-formula id="ieqn-160"><mml:math id="mml-ieqn-160"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> a time-dependent coefficient for <inline-formula id="ieqn-161"><mml:math id="mml-ieqn-161"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The computation would be more efficient if a much smaller subset with, say, <inline-formula id="ieqn-162"><mml:math id="mml-ieqn-162"><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>, POD basis functions,</p>
<p><disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2248;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-163"><mml:math id="mml-ieqn-163"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> is a subset of indices in the set <inline-formula id="ieqn-164"><mml:math id="mml-ieqn-164"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, and such that the approximation in Eq. (<xref ref-type="disp-formula" rid="eqn-4">4</xref>) is with minimum error compared to the approximation in Eq. (<xref ref-type="disp-formula" rid="eqn-3">3</xref>)<inline-formula id="ieqn-165"><mml:math id="mml-ieqn-165"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> using the much larger set of <inline-formula id="ieqn-166"><mml:math id="mml-ieqn-166"><mml:mi>k</mml:mi></mml:math></inline-formula> POD basis functions.</p>
<p>In a Galerkin-Project (GP) approach to reduced-order model, a small subset of dominant modes form a basis onto which high-dimensional differential equations are projected to obtain a set of lower-dimensional differential equations for cost-efficient computational analysis.</p>
<p>Instead of using GP, RNNs (Recurrent Neural Networks) were used in [<xref ref-type="bibr" rid="ref-26">26</xref>] to predict the evolution of fluid flows, specifically the coefficients of the dominant POD modes, rather than solving differential equations. For this purpose, their LSTM-ROM (Long Short-Term Memory - Reduced Order Model) approach combined concepts of ROM based on POD with deep-learning neural networks using either the original LSTM units, Figure <xref ref-type="fig" rid="fig-117">117</xref> (left) [<xref ref-type="bibr" rid="ref-24">24</xref>], or the bidirectional LSTM (BiLSTM), Figure <xref ref-type="fig" rid="fig-117">117</xref> (right) [<xref ref-type="bibr" rid="ref-104">104</xref>], the internal states of which were well-suited for the modeling of dynamical systems.</p>
<fig id="fig-18">
<label>Figure 18</label>
<caption><title><italic>Reduced-order POD basis</italic> (Sections <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>, <xref ref-type="sec" rid="s12_1">12.1</xref>). For each dataset (also Figure <xref ref-type="fig" rid="fig-116">116</xref>), which contained <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>k</mml:mi></mml:math></inline-formula> snapshots, the full POD reconstruction of the flow-field dynamical quantity <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> is a point in the 3-D flow field, consists of all <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>k</mml:mi></mml:math></inline-formula> basis functions <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>, using Eq. (<xref ref-type="disp-formula" rid="eqn-3">3</xref>); see also Eq. (<xref ref-type="disp-formula" rid="eqn-439">439</xref>). Typically, <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mi>k</mml:mi></mml:math></inline-formula> is large; a reduced-order POD basis consists of selecting <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> basis functions for the reconstruction, with the smallest error possible. See Figure <xref ref-type="fig" rid="fig-19">19</xref> for the use of deep-learning networks to predict <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula>, with <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:math></inline-formula>, given <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-26">26</xref>]. (Figure reproduced with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-18.tif"/>
</fig>
<p>To obtain training/testing data, which were crucial to train/test neural networks, the data from transient 3-D Direct Navier-Stokes (DNS) simulations of two physical problems, as provided by the Johns Hopkins turbulence database [<xref ref-type="bibr" rid="ref-105">105</xref>] were used [<xref ref-type="bibr" rid="ref-26">26</xref>]: (1) The <italic>Forced Isotropic Turbulence</italic> (ISO) and (2) The <italic>Magnetohydrodynamic Turbulence</italic> (MHD).</p>
<p>To generate training data for LSTM/BiLSTM networks, the 3-D turbulent fluid flow domain of each physical problem was decomposed into five equidistant 2-D planes (slices), with one additional equidistant 2-D plane served to generate testing data (Section <xref ref-type="sec" rid="s12">12</xref>, Figure <xref ref-type="fig" rid="fig-116">116</xref>, Remark <xref ref-type="statement" rid="st12_1">12.1</xref>). For the same subregion in each of those 2-D planes, POD was applied on the <inline-formula id="ieqn-167"><mml:math id="mml-ieqn-167"><mml:mi>k</mml:mi></mml:math></inline-formula> snapshots of the velocity field (<inline-formula id="ieqn-168"><mml:math id="mml-ieqn-168"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mn>023</mml:mn></mml:math></inline-formula> for ISO, <inline-formula id="ieqn-169"><mml:math id="mml-ieqn-169"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>024</mml:mn></mml:math></inline-formula> for MHD, Section <xref ref-type="sec" rid="s12_1">12.1</xref>), and out of <inline-formula id="ieqn-170"><mml:math id="mml-ieqn-170"><mml:mi>k</mml:mi></mml:math></inline-formula> POD modes <inline-formula id="ieqn-171"><mml:math id="mml-ieqn-171"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, the five (<inline-formula id="ieqn-172"><mml:math id="mml-ieqn-172"><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>5</mml:mn><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>) most dominant POD modes <inline-formula id="ieqn-173"><mml:math id="mml-ieqn-173"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> representative of the flow dynamics (Figure 18) were retained to form a reduced-order basis onto which the velocity field was projected. The coefficient <inline-formula id="ieqn-174"><mml:math id="mml-ieqn-174"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the POD mode <inline-formula id="ieqn-175"><mml:math id="mml-ieqn-175"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represented the evolution of the participation of that mode in the velocity field, and was decomposed into thousands of small samples using a moving window. The first half of each sample was used as input signal to an LSTM network, whereas the second half of the sample was used as output signal for supervised training of the network. Two different methods were proposed [<xref ref-type="bibr" rid="ref-26">26</xref>]:</p>
<list list-type="simple">
<list-item><p>(1) <italic>Multiple-network method:</italic> Use a RNN for each coefficient of the dominant POD modes</p></list-item>
<list-item><p>(2) <italic>Single-network method:</italic> Use a single RNN for all coefficients of the dominant POD modes</p></list-item></list>
<p>For both methods, variants with the original LSTM units or the BiLSTM units were implemented. Each of the employed RNN had a single hidden layer.</p>
<p>Demonstrative results for the prediction capabilities of both the original LSTM and the BiLSTM networks are illustrated in Figure <xref ref-type="fig" rid="fig-20">20</xref>. Contrary to the authors&#x2019; expectation, networks with the original LSTM units performed better than those using BiLSTM units in both physical problems of isotropic turbulence (ISO) (Figure <xref ref-type="fig" rid="fig-20">20a</xref>) and magnetohydrodyanmics (MHD) (Figure <xref ref-type="fig" rid="fig-20">20b</xref>) [<xref ref-type="bibr" rid="ref-26">26</xref>].</p>
<p>Details of the formulation in [<xref ref-type="bibr" rid="ref-26">26</xref>] are discussed in Section <xref ref-type="sec" rid="s12">12</xref>.</p>
<fig id="fig-19">
<label>Figure 19</label>
<caption><title><italic>Deep-learning LSTM/BiLSTM Reduced Order Model</italic> (Sections <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>, <xref ref-type="sec" rid="s12_2">12.2</xref>). See Figure <xref ref-type="fig" rid="fig-18">18</xref> for POD reduced-order basis. The time-dependent coefficients <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the dominant POD modes <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, with <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, from each training dataset were used to train LSTM (or BiLSTM, Figure <xref ref-type="fig" rid="fig-117">117</xref>) neural networks to predict <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula>, with <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mrow><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:math></inline-formula>, given <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the test datasets. [<xref ref-type="bibr" rid="ref-26">26</xref>]. (Figure reproduced with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-19.tif"/>
</fig>
</sec></sec>
</sec>
<sec id="s3"><label>3</label>
<title>Computational mechanics, neuroscience, deep learning</title>
<p>Table <xref ref-type="table" rid="table-1">1</xref> below presents a rough comparison that shows the parallelism in the modeling steps in three fields: Computational mechanics, neuroscience, and deep learning, which heavily indebted to neuroscience until it reached more mature state, and then took on its own development.</p>
<p>We assume that readers are familiar with the concepts listed in the second column on &#x201C;Computational mechanics&#x201D;, and briefly explain some key concepts in the third column on &#x201C;Neuroscience&#x201D; to connect to the fourth column &#x201C;Deep learning&#x201D;, which is explained in detail in subsequent sections.</p>
<p>See Section <xref ref-type="sec" rid="s13_2">13.2</xref> for more details on the theoretical foundation based on Volterra series for the spatial and temporal combinations of inputs, weights, and biases, widely used in artificial neural networks or multilayer perceptrons.</p>
<p>Neuron spiking response such as shown in Figure <xref ref-type="fig" rid="fig-21">21</xref> can be modelled accurately using a model such as &#x201C;Integrate-and-Fire&#x201D;. The firing-rate response <inline-formula id="ieqn-176"><mml:math id="mml-ieqn-176"><mml:mi>r</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of a biological neuron to a stimulus <inline-formula id="ieqn-177"><mml:math id="mml-ieqn-177"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is described by a convolution integral<xref ref-type="fn" rid="fn46"><sup>46</sup></xref><fn id="fn46"><label>46</label><p>From here on, if Eq. (<xref ref-type="disp-formula" rid="eqn-5">5</xref>) is found a bit abstract at first reading, first-time learners could skip the remaining of this short Section <xref ref-type="sec" rid="s3">3</xref> to begin reading Section <xref ref-type="sec" rid="s4">4</xref>, and come back later after reading through subsequent sections, particularly Section <xref ref-type="sec" rid="s13_2">13.2</xref>, to have an overview of the connection among seemingly separate topics.</p></fn></p>
<p><disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>r</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mi>&#x03C4;</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-178"><mml:math id="mml-ieqn-178"><mml:msub><mml:mi>r</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is the background firing rate at zero stimulus, <inline-formula id="ieqn-179"><mml:math id="mml-ieqn-179"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the synaptic kernel, and <inline-formula id="ieqn-180"><mml:math id="mml-ieqn-180"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the stimulus; see, e.g., [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 46, Eq. (2.1).<xref ref-type="fn" rid="fn47"><sup>47</sup></xref><fn id="fn47"><label>47</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-5">5</xref>) is a reformulated version of Eq. (2.1) in [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 46, and is similar to Eqs. (7.1)-(7.2) in [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 233, Chapter &#x201C;7 Network Models&#x201D;.</p></fn>. The stimulus <inline-formula id="ieqn-181"><mml:math id="mml-ieqn-181"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-5">5</xref>) is a train (sequence in time) of spikes described by</p>
<p><disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:mrow><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-182"><mml:math id="mml-ieqn-182"><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the Dirac delta. Eq. (<xref ref-type="disp-formula" rid="eqn-5">5</xref>) then describes the firing rate <inline-formula id="ieqn-183"><mml:math id="mml-ieqn-183"><mml:mi>r</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at time <inline-formula id="ieqn-184"><mml:math id="mml-ieqn-184"><mml:mi>t</mml:mi></mml:math></inline-formula> as the collective memory effect of all spikes, going from the current time <inline-formula id="ieqn-185"><mml:math id="mml-ieqn-185"><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi></mml:math></inline-formula> back far in the past with <inline-formula id="ieqn-186"><mml:math id="mml-ieqn-186"><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, with the weight for the spike <inline-formula id="ieqn-187"><mml:math id="mml-ieqn-187"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at time <inline-formula id="ieqn-188"><mml:math id="mml-ieqn-188"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> provided by the value of the synaptic kernel <inline-formula id="ieqn-189"><mml:math id="mml-ieqn-189"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at the same time <inline-formula id="ieqn-190"><mml:math id="mml-ieqn-190"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>.</p>

<fig id="fig-20">
<label>Figure 20</label>
<caption><title><italic>Prediction results and errors for LSTM/BiLSTM networks</italic> (Sections <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>, <xref ref-type="sec" rid="s12_3">12.3</xref>). Coefficients <italic>&#x03B1;i(t)</italic>, for <italic>i</italic> = 1; : : : ; 5, of dominant POD modes. LSTM had smaller errors compared to BiLSTM for both physical problems (ISO and MHD).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-20.tif"/>
</fig>
<fig id="fig-21">
<label>Figure 21</label>
<caption><title><italic>Biological neuron response to stimulus, experimental result</italic> (Section <xref ref-type="sec" rid="s3">3</xref>). An oscillating current was injected into the neuron (top), and neuron spiking response was recorded (below) [<xref ref-type="bibr" rid="ref-106">106</xref>], with permission from <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fnins.2011.00009">the authors and Frontiers Media SA</ext-link>. A spike, also called an action potential, is an electrical potential pulse across the cell membrane that lasts about 100 milivolts over 1 miliseconds. &#x201C;Neurons represent and transmit information by firing sequences of spikes in various temporal patterns&#x201D; [<xref ref-type="bibr" rid="ref-19">19</xref>], pp. 3-4.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-21.tif"/>
</fig>
<p>It will be seen in Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> on &#x201C;Dynamics, time dependence, Volterra series&#x201D; that the convolution integral in Eq. (<xref ref-type="disp-formula" rid="eqn-5">5</xref>) corresponds to the linear part of the Volterra series of nonlinear response of a biological neuron in terms of the stimulus, Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>), which in turn provides the theoretical foundation for taking the linear combination of inputs, weights, and biases for an artificial neuron in a multilayer neural networks, as represented by Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>).</p>
<p>The Integrate-and-Fire model for biological neuron provides a motivation for the use of the rectified linear units (ReLU) as activation function in multilayer neural networks (or perceptrons); see Figure <xref ref-type="fig" rid="fig-28">28</xref>.</p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-5">5</xref>) is also related to the exponential smoothing technique used in forecasting and applied to stochastic optimization methods to train multilayer neural networks; see Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref> on &#x201C;Forecasting time series, exponential smoothing&#x201D;.</p>
</sec>
<sec id="s4"><label>4</label>
<title>Statics, feedforward networks</title>
<p>We examine in detail the forward propagation in feedforward networks, in which the function mappings flow<xref ref-type="fn" rid="fn48"><sup>48</sup></xref><fn id="fn48"><label>48</label><p>There is no physical flow here, only function mappings.</p></fn> only one forward direction, from input to output.</p>
<sec id="s4_1"><label>4.1</label>
<title>Two concept presentations</title>
<p>There are two ways to present the concept of deep-learning neural networks: The top-down approach versus the bottom-up approach.</p>
<sec id="s4_1_1"><label>4.1.1</label>
<title>Top-down approach</title>
<p>The <italic>top-down</italic> approach starts by giving up-front the mathematical big picture of what a neural network is, with the big-picture (high level) graphical representation, then gradually goes down to the detailed specifics of a processing unit (often referred to as an <italic>artificial neuron</italic>) and its low-level graphical representation. A definite advantage of this top-down approach is that readers new to the field immediately have the big picture in mind, before going down to the nitty-gritty details, and thus tend not to get lost. An excellent reference for the top-down approach is [<xref ref-type="bibr" rid="ref-78">78</xref>], and there are not many such references.</p> 
<p>Specifically, for a multilayer feedforward network, by top-down, we mean starting from a general description in Eq. (<xref ref-type="disp-formula" rid="eqn-18">18</xref>) and going down to the detailed construct of a neuron through a weighted sum with bias in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) and then a nonlinear activation function in Eq. (<xref ref-type="disp-formula" rid="eqn-35">35</xref>).</p>
<p>In terms of block diagrams, we begin our <italic>top-down</italic> descent from the big picture of the overall multilayer neural network with <inline-formula id="ieqn-196"><mml:math id="mml-ieqn-196"><mml:mi>L</mml:mi></mml:math></inline-formula> layers in Figure <xref ref-type="fig" rid="fig-23">23</xref>, through Figure <xref ref-type="fig" rid="fig-34">34</xref> for a typical layer <inline-formula id="ieqn-197"><mml:math id="mml-ieqn-197"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and Figure <xref ref-type="fig" rid="fig-35">35</xref> for the lower-level details of layer <inline-formula id="ieqn-198"><mml:math id="mml-ieqn-198"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, then down to the most basic level, a neuron in Figure <xref ref-type="fig" rid="fig-36">36</xref> as one row in layer <inline-formula id="ieqn-199"><mml:math id="mml-ieqn-199"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-35">35</xref>, the equivalent figure of Figure <xref ref-type="fig" rid="fig-8">8</xref>, which in turn was the starting point in [<xref ref-type="bibr" rid="ref-38">38</xref>].</p>
</sec>
<sec id="s4_1_2"><label>4.1.2</label>
<title>Bottom-up approach</title>
<p>The <italic>bottom-up</italic> approach typically starts with a biological neuron (see Figure <xref ref-type="fig" rid="fig-131">131</xref> in Section <xref ref-type="sec" rid="s13_1">13.1</xref> below), then introduces an <italic>artificial neuron</italic> that looks similar to the biological neuron (compare Figure <xref ref-type="fig" rid="fig-8">8</xref> to Figure <xref ref-type="fig" rid="fig-131">131</xref>), with multiple inputs and a single output, which becomes an input to each of a multitude of other artificial neurons; see, e.g., [<xref ref-type="bibr" rid="ref-23">23</xref>] [<xref ref-type="bibr" rid="ref-21">21</xref>] [<xref ref-type="bibr" rid="ref-38">38</xref>] [<xref ref-type="bibr" rid="ref-20">20</xref>].<xref ref-type="fn" rid="fn49"><sup>49</sup></xref><fn id="fn49"><label>49</label><p>Figure2 in[<xref ref-type="bibr" rid="ref-20">20</xref>] is essentially the same as Figure <xref ref-type="fig" rid="fig-8">8</xref>.</p></fn> Even though Figure <xref ref-type="fig" rid="fig-7">7</xref>, which preceded Figure <xref ref-type="fig" rid="fig-8">8</xref> in [<xref ref-type="bibr" rid="ref-38">38</xref>], showed a network, but the information content is not the same as Figure <xref ref-type="fig" rid="fig-23">23</xref>.</p>
<p>Unfamiliar readers when looking at the graphical representation of an artificial neural network (see, e.g., Figure <xref ref-type="fig" rid="fig-7">7</xref>) could be misled in thinking in terms of electrical (or fluid-flow) networks, in which Kirchhoff&#x2019;s law applies at the junction where the output is split into different directions to go toward other artificial neurons. The big picture is not clear at the outset, and could be confusing to readers new to the field, who would take some time to understand; see also Footnote <xref ref-type="fn" rid="fn5">5</xref>. By contrast, Figure <xref ref-type="fig" rid="fig-23">23</xref> clearly shows a multilevel function composition, assuming that first-time learners are familiar with this basic mathematical concept.</p>
</sec>
</sec>
<sec id="s4_2"><label>4.2</label>
<title>Matrix notation</title>
<p>In mechanics and physics, tensors are intrinsic geometrical objects, which can be represented by infinitely many matrices of components, depending on the coordinate systems.<xref ref-type="fn" rid="fn50"><sup>50</sup></xref><fn id="fn50"><label>50</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-107">107</xref>], [<xref ref-type="bibr" rid="ref-108">108</xref>].</p></fn> vectors are tensors of order 1. For this reason, we do <italic>not</italic> use neither the name &#x201C;vector&#x201D; for a column matrix, nor the name &#x201C;tensor&#x201D; for an array with more than two indices.<xref ref-type="fn" rid="fn51"><sup>51</sup></xref><fn id="fn51"><label>51</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 31, where a &#x201C;vector&#x201D; is a column matrix, and a &#x201C;tensor&#x201D; is an array with coefficients (elements) having more than two indices, e.g., <inline-formula id="ieqn-3010"><mml:math id="mml-ieqn-3010"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. It is important to know the terminologies used in computer-science literature.</p></fn> All arrays are matrices.</p>
<table-wrap id="table-1"><label>Table 1</label>
<caption><p><italic>Top-down (rough) comparison of modeling steps in three fields</italic> (Section <xref ref-type="sec" rid="s3">3</xref>): Computational mechanics, neuroscience, and deep learning</p></caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="center">Study object</th>
<th align="center">Engineering continuum</th>
<th align="center">The Brain</th>
<th align="center">Image recognition</th>
</tr>
<tr>
<th align="center">Field</th>
<th align="center">Computational mechanics</th>
<th align="center">Computational neuroscience</th>
<th align="center">Deep learning</th>
</tr>
<tr>
<th align="center">Modeling<break/>Inputs</th>
<th align="center">Partial Differential Equations<break/>Forces (solids), velocities (fluids)</th>
<th align="center">Biological neural networks<break/>Firing rate as stimulus</th>
<th align="center">Artificial neural networks<break/>An image to classify</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">1</td>
<td align="center">Weak form, finite-element mesh, order of interpolation</td>
<td align="center">Network architectures, two layers (input, output), several neurons per layer</td>
<td align="center">Network architectures, many layers (input, hidden, output), very high number of neurons and parameters</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">Elements</td>
<td align="center">Neurons, dendrites, synapses, axons</td>
<td align="center">Processing units (neurons, perceptrons)</td>  
</tr>
<tr>
<td align="center">3</td>
<td align="center">Nonlinear force-displacement and stress-strain (&#x03C3;-&#x03f5;) relations</td>
<td align="center">Firing model, spiking model, firing rate vs input current (FI) relation, continuous stimulus and response, Volterra series, kernels of increasing orders</td>
<td align="center">&#x2014;</td>  
</tr>
<tr>
<td align="center">4</td>
<td align="center">Linearized force-displacement and stress-strain relations (Hooke&#x2019;s law)</td>
<td align="center">Linear term in Volterra series, synaptic kernel &#x1D4A6;<sub>1</sub>(&#x03C4;) of order 1, continuous temporal weight</td>
<td align="center">Many hidden layers (discrete weights and biases)</td>  
</tr>
<tr>
<td align="center">5</td>
<td align="center">&#x2014;</td>
<td align="center">Linear combination of inputs, with input weights <bold><italic>w</italic></bold></td>
<td align="center">Linear combination of inputs plus biases, input weights <bold><italic>w</italic></bold></td>  
</tr>
<tr>
<td align="center">6</td>
<td align="center">&#x2014;</td>
<td align="center">Static nonlinearity</td>
<td align="center">Activation function</td>  
</tr>
<tr>
<td align="center">Outputs</td>
<td align="center">Displacements (solids), velocities (fluids)</td>
<td align="center">Firing rate as response</td>
<td align="center">Image classified (car, frog, human)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The matrix notation used here can follow either (1) the Matlab / Octave code syntax, or (2) the more compact component convention for tensors in mechanics.</p>
<p>Using Matlab / Octave code syntax, the inputs to a network (to be defined soon) are gathered in an <inline-formula id="ieqn-200"><mml:math id="mml-ieqn-200"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> column matrix <inline-formula id="ieqn-201"><mml:math id="mml-ieqn-201"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> of real numbers, and the (<italic>target</italic> or labeled) outputs<xref ref-type="fn" rid="fn52"><sup>52</sup></xref><fn id="fn52"><label>52</label><p>The inputs <inline-formula id="ieqn-3011"><mml:math id="mml-ieqn-3011"><mml:mi mathvariant='bold-italic'>x</mml:mi></mml:math></inline-formula> and the <italic>target</italic> (or labeled) outputs <inline-formula id="ieqn-3012"><mml:math id="mml-ieqn-3012"><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:math></inline-formula> are the data used to train the network, which produces the predicted (or approximated) output denoted by <inline-formula id="ieqn-3013"><mml:math id="mml-ieqn-3013"><mml:mover accent='true'><mml:mi mathvariant='bold-italic'>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:math></inline-formula> with an overhead tilde reminiscent of the approximation symbol &#x2248;. See also Footnote <xref ref-type="fn" rid="fn87">87</xref>.</p></fn> in an <inline-formula id="ieqn-202"><mml:math id="mml-ieqn-202"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> column matrix <inline-formula id="ieqn-203"><mml:math id="mml-ieqn-203"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> 
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>;</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where the commas are separators for matrix elements in a row, and the semicolons are separators for rows. For matrix transpose, we stick to the standard notation using the superscript &#x201C;<inline-formula id="ieqn-204"><mml:math id="mml-ieqn-204"><mml:mi>T</mml:mi></mml:math></inline-formula>&#x201D; for written documents, instead of the prime &#x201C;<inline-formula id="ieqn-205"><mml:math id="mml-ieqn-205"><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula>&#x201D; as used in matlab / octave code. In addition, the prime &#x201C;<inline-formula id="ieqn-206"><mml:math id="mml-ieqn-206"><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula>&#x201D; is more customarily used to denote derivative in handwritten and in typeset equations.</p>
<p>Using the component convention for tensors in mechanics,<xref ref-type="fn" rid="fn53"><sup>53</sup></xref><fn id="fn53"><label>53</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-107">107</xref>] [<xref ref-type="bibr" rid="ref-109">109</xref>] [<xref ref-type="bibr" rid="ref-110">110</xref>].</p></fn> The coefficients of a <inline-formula id="ieqn-207"><mml:math id="mml-ieqn-207"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> matrix shown below 
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>A</mml:mi><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
are arranged according to the following convention for the free indices <inline-formula id="ieqn-208"><mml:math id="mml-ieqn-208"><mml:mi>i</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-209"><mml:math id="mml-ieqn-209"><mml:mi>j</mml:mi></mml:math></inline-formula>, which are automatically expanded to their respective full range, i.e., <inline-formula id="ieqn-210"><mml:math id="mml-ieqn-210"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-211"><mml:math id="mml-ieqn-211"><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> when the variable <inline-formula id="ieqn-212"><mml:math id="mml-ieqn-212"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi>A</mml:mi><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> are enclosed in square brackets:</p>
<list list-type="simple">
<list-item><label>(1)</label><p>In case both indices are subscripts, then the left subscript (index <inline-formula id="ieqn-213"><mml:math id="mml-ieqn-213"><mml:mi>i</mml:mi></mml:math></inline-formula> of <inline-formula id="ieqn-214"><mml:math id="mml-ieqn-214"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-8">8</xref>)) denotes the row index, whereas the right subscript (index <inline-formula id="ieqn-215"><mml:math id="mml-ieqn-215"><mml:mi>j</mml:mi></mml:math></inline-formula> of <inline-formula id="ieqn-216"><mml:math id="mml-ieqn-216"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-8">8</xref>)) denotes the column index.</p></list-item>
<list-item><label>(2)</label><p>In case one index is a superscript, and the other index is a subscript, then the superscript (upper index <inline-formula id="ieqn-217"><mml:math id="mml-ieqn-217"><mml:mi>i</mml:mi></mml:math></inline-formula> of <inline-formula id="ieqn-218"><mml:math id="mml-ieqn-218"><mml:msubsup><mml:mi>A</mml:mi><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-8">8</xref>)) is the row index, and the subscript (lower index <inline-formula id="ieqn-219"><mml:math id="mml-ieqn-219"><mml:mi>j</mml:mi></mml:math></inline-formula> of <inline-formula id="ieqn-220"><mml:math id="mml-ieqn-220"><mml:msubsup><mml:mi>A</mml:mi><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-8">8</xref>)) is the column index.<xref ref-type="fn" rid="fn54"><sup>54</sup></xref><fn id="fn54"><label>54</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-111">111</xref>], Footnote <xref ref-type="fn" rid="fn11">11</xref>. For example, <inline-formula id="ieqn-3015"><mml:math id="mml-ieqn-3015"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi>A</mml:mi><mml:mn>2</mml:mn><mml:mn>3</mml:mn></mml:msubsup></mml:math></inline-formula> is the coefficient in row 3 and column 2.</p></fn></p></list-item></list>
<p>With this convention (lower index designates column index, while upper index designates row index), the coefficients of array <inline-formula id="ieqn-221"><mml:math id="mml-ieqn-221"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-7">7</xref>) can be presented either in row form (with lower index) or in column form (with upper index) as follows: 
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msup><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msup><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msup><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Instead of automatically associating any matrix variable such as <inline-formula id="ieqn-222"><mml:math id="mml-ieqn-222"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> to the column matrix of its components, the matrix dimensions are clearly indicated as in Eq. (<xref ref-type="disp-formula" rid="eqn-7">7</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-9">9</xref>), i.e., by specifying the values <inline-formula id="ieqn-223"><mml:math id="mml-ieqn-223"><mml:mi>m</mml:mi></mml:math></inline-formula> (number of rows) and <inline-formula id="ieqn-224"><mml:math id="mml-ieqn-224"><mml:mi>n</mml:mi></mml:math></inline-formula> (number of columns) of its containing space <inline-formula id="ieqn-225"><mml:math id="mml-ieqn-225"><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>.</p>
<p>Consider the Jacobian matrix 
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;and let&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi>A</mml:mi><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>:=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula id="ieqn-226"><mml:math id="mml-ieqn-226"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-227"><mml:math id="mml-ieqn-227"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> are column matrices shown in Eq. (<xref ref-type="disp-formula" rid="eqn-7">7</xref>). Then the coefficients of this Jacobian matrix are arranged with the upper index <inline-formula id="ieqn-228"><mml:math id="mml-ieqn-228"><mml:mi>i</mml:mi></mml:math></inline-formula> being the row index, and the lower index <inline-formula id="ieqn-229"><mml:math id="mml-ieqn-229"><mml:mi>j</mml:mi></mml:math></inline-formula> being the column index.<xref ref-type="fn" rid="fn55"><sup>55</sup></xref><fn id="fn55"><label>55</label><p>For example, the coefficient <inline-formula id="ieqn-3018"><mml:math id="mml-ieqn-3018"><mml:msubsup><mml:mi>A</mml:mi><mml:mn>2</mml:mn><mml:mn>3</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac></mml:math></inline-formula> is in row 3 and column 2. The Jacobian matrix in this convention is the transpose of that used in [<xref ref-type="bibr" rid="ref-39">39</xref>], p. 175.</p></fn> This convention is natural when converting a chain rule in component form into matrix form, i.e., consider the composition of matrix functions 
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>p</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>p</mml:mi></mml:msup><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>n</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>n</mml:mi></mml:msup><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:msup><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>l</mml:mi></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where implicitly 
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>p</mml:mi></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>l</mml:mi></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
are the spaces of column matrices, Then using the chain rule 
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where the summation convention on the repeated indices <inline-formula id="ieqn-230"><mml:math id="mml-ieqn-230"><mml:mi>r</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-231"><mml:math id="mml-ieqn-231"><mml:mi>s</mml:mi></mml:math></inline-formula> is implied. Then the Jacobian matrix <inline-formula id="ieqn-232"><mml:math id="mml-ieqn-232"><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> can be obtained directly as a product of Jacobian matrices from the chain rule just by putting square brackets around each factor: 
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</p>
<p>Consider the scalar function <inline-formula id="ieqn-233"><mml:math id="mml-ieqn-233"><mml:mi>E</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:msup><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> that maps the column matrix <inline-formula id="ieqn-234"><mml:math id="mml-ieqn-234"><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:msup></mml:math></inline-formula> into a scalar, then the components of the gradient of <inline-formula id="ieqn-235"><mml:math id="mml-ieqn-235"><mml:mi>E</mml:mi></mml:math></inline-formula> with respect to <inline-formula id="ieqn-236"><mml:math id="mml-ieqn-236"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> are arranged in a row matrix defined as follows: 
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mi>E</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
with <inline-formula id="ieqn-237"><mml:math id="mml-ieqn-237"><mml:msubsup><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mi>E</mml:mi></mml:math></inline-formula> being the transpose of the <inline-formula id="ieqn-238"><mml:math id="mml-ieqn-238"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> column matrix <inline-formula id="ieqn-239"><mml:math id="mml-ieqn-239"><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mi>E</mml:mi></mml:math></inline-formula> containing these same components.<xref ref-type="fn" rid="fn56"><sup>56</sup></xref><fn id="fn56"><label>56</label><p>In [<xref ref-type="bibr" rid="ref-78">78</xref>], the column matrix (which is called &#x201C;vector&#x201D;) <inline-formula id="ieqn-3021"><mml:math id="mml-ieqn-3021"><mml:msub><mml:mi>&#x025BD;</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mi>E</mml:mi></mml:math></inline-formula> is referred to as the gradient of <inline-formula id="ieqn-3022"><mml:math id="mml-ieqn-3022"><mml:mi>E</mml:mi></mml:math></inline-formula>. Later on in the present paper, <inline-formula id="ieqn-3023"><mml:math id="mml-ieqn-3023"><mml:mi>E</mml:mi></mml:math></inline-formula> will be called the error or &#x201C;loss&#x201D; function, <inline-formula id="ieqn-3024"><mml:math id="mml-ieqn-3024"><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:math></inline-formula> the outputs of a neural network, and the gradient of <inline-formula id="ieqn-3025"><mml:math id="mml-ieqn-3025"><mml:mi>E</mml:mi></mml:math></inline-formula> with respect to <inline-formula id="ieqn-3026"><mml:math id="mml-ieqn-3026"><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:math></inline-formula> is the first step in the &#x201C;backpropagation&#x201D; algorithm in Section <xref ref-type="sec" rid="s5">5</xref> to find the gradient of <inline-formula id="ieqn-3027"><mml:math id="mml-ieqn-3027"><mml:mi>E</mml:mi></mml:math></inline-formula> with respect to the network parameters collected in the matrix <inline-formula id="ieqn-3028"><mml:math id="mml-ieqn-3028"><mml:mi mathvariant='bold-italic'>&#x03B8;</mml:mi></mml:math></inline-formula> for an optimization descent direction to minimize <inline-formula id="ieqn-3029"><mml:math id="mml-ieqn-3029"><mml:mi>E</mml:mi></mml:math></inline-formula>.</p></fn></p>
<p>Now consider this particular scalar function below:<xref ref-type="fn" rid="fn57"><sup>57</sup></xref><fn id="fn57"><label>57</label><p>Soon, it will be seen in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) that the function <inline-formula id="ieqn-3030"><mml:math id="mml-ieqn-3030"><mml:mi>z</mml:mi></mml:math></inline-formula> is a linear combination of the network inputs <inline-formula id="ieqn-3031"><mml:math id="mml-ieqn-3031"><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:math></inline-formula>, which are outputs coming from the previous network layer, with <inline-formula id="ieqn-3032"><mml:math id="mml-ieqn-3032"><mml:mi mathvariant='bold-italic'>w</mml:mi></mml:math></inline-formula> being the weights. An advantage of defining <inline-formula id="ieqn-3033"><mml:math id="mml-ieqn-3033"><mml:mi mathvariant='bold-italic'>w</mml:mi></mml:math></inline-formula> as a row matrix, instead of a column matrix like <inline-formula id="ieqn-3034"><mml:math id="mml-ieqn-3034"><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:math></inline-formula>, is to de-clutter the equations in dispensing of (1) the superscript <inline-formula id="ieqn-3035"><mml:math id="mml-ieqn-3035"><mml:mi>T</mml:mi></mml:math></inline-formula> designating the transpose as in <inline-formula id="ieqn-3036"><mml:math id="mml-ieqn-3036"><mml:msup><mml:mi mathvariant='bold-italic'>w</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula>, or (2) the dot product symbol as in <inline-formula id="ieqn-3037"><mml:math id="mml-ieqn-3037"><mml:mi mathvariant='bold-italic'>w</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula>, leaving space for other indices, such as in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>).</p></fn> 
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>Z</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Then the gradients of <inline-formula id="ieqn-240"><mml:math id="mml-ieqn-240"><mml:mi>z</mml:mi></mml:math></inline-formula> are<xref ref-type="fn" rid="fn58"><sup>58</sup></xref><fn id="fn58"><label>58</label><p>The gradients of <inline-formula id="ieqn-3038"><mml:math id="mml-ieqn-3038"><mml:mi>z</mml:mi></mml:math></inline-formula> will be used in the backpropagation algorithm in Section <xref ref-type="sec" rid="s5">5</xref> to obtain the gradient of the error (or loss) function <inline-formula id="ieqn-3039"><mml:math id="mml-ieqn-3039"><mml:mi>E</mml:mi></mml:math></inline-formula> to find the optimal weights that minimize <inline-formula id="ieqn-3040"><mml:math id="mml-ieqn-3040"><mml:mi>E</mml:mi></mml:math></inline-formula>.</p></fn> 
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">&#x27F9;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</p></sec>
<sec id="s4_3"><label>4.3</label>
<title>Big picture, composition of concepts</title>
<p>A fully-connected feedforward network is a chain of successive applications of functions <inline-formula id="ieqn-241"><mml:math id="mml-ieqn-241"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> with <inline-formula id="ieqn-242"><mml:math id="mml-ieqn-242"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula>, one after another&#x2014;with <inline-formula id="ieqn-243"><mml:math id="mml-ieqn-243"><mml:mi>L</mml:mi></mml:math></inline-formula> being the number of &#x201C;layers&#x201D; or the <italic>depth</italic> of the network&#x2014;on the input <inline-formula id="ieqn-244"><mml:math id="mml-ieqn-244"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> to produce the predicted output <inline-formula id="ieqn-245"><mml:math id="mml-ieqn-245"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> for the target output <inline-formula id="ieqn-246"><mml:math id="mml-ieqn-246"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula>: 
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2218;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2218;</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>&#x2218;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2218;</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>&#x2218;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2218;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
or breaking Eq. (<xref ref-type="disp-formula" rid="eqn-18">18</xref>) down, step by step, from inputs to outputs:<xref ref-type="fn" rid="fn59"><sup>59</sup></xref><fn id="fn59"><label>59</label><p>To alleviate the notation, the predicted output <inline-formula id="ieqn-3041"><mml:math id="mml-ieqn-3041"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> from layer <inline-formula id="ieqn-3042"><mml:math id="mml-ieqn-3042"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is indicated by the superscript <inline-formula id="ieqn-3043"><mml:math id="mml-ieqn-3043"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, without the tilde. The output <inline-formula id="ieqn-3044"><mml:math id="mml-ieqn-3044"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> from the last layer <inline-formula id="ieqn-3045"><mml:math id="mml-ieqn-3045"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the network predicted output <inline-formula id="ieqn-3046"><mml:math id="mml-ieqn-3046"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>.</p></fn> 
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mtext>&#xA0;(inputs)&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mtext>&#xA0;(predicted outputs)&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</p>
<statement id="st4_1"><title><xref ref-type="statement" rid="st4_1">Remark 4.1</xref>.</title>
<p>The notation <inline-formula id="ieqn-247"><mml:math id="mml-ieqn-247"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, for <inline-formula id="ieqn-248"><mml:math id="mml-ieqn-248"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-19">19</xref>) will be useful to develop a concise formulation for the computation of the gradient of the cost function relative to the parameters by backpropagation for use in training (finding optimal parameters); see Eqs. (<xref ref-type="disp-formula" rid="eqn-91">91</xref>)-(<xref ref-type="disp-formula" rid="eqn-92">92</xref>) in Section <xref ref-type="sec" rid="s5">5</xref> on Backpropagation.</p></statement>
<p>The quantities associated with layer <inline-formula id="ieqn-249"><mml:math id="mml-ieqn-249"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in a network are indicated with the superscript <inline-formula id="ieqn-250"><mml:math id="mml-ieqn-250"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, so that the inputs to layer <inline-formula id="ieqn-251"><mml:math id="mml-ieqn-251"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as gathered in the <inline-formula id="ieqn-252"><mml:math id="mml-ieqn-252"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> column matrix 
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>are the <italic>predicted</italic> outputs from the previous layer <inline-formula id="ieqn-253"><mml:math id="mml-ieqn-253"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, gathered in the matrix <inline-formula id="ieqn-254"><mml:math id="mml-ieqn-254"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, With <inline-formula id="ieqn-255"><mml:math id="mml-ieqn-255"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> being the <italic>width</italic> of layer <inline-formula id="ieqn-256"><mml:math id="mml-ieqn-256"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Similarly, the outputs of layer <inline-formula id="ieqn-257"><mml:math id="mml-ieqn-257"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as gathered in the <inline-formula id="ieqn-258"><mml:math id="mml-ieqn-258"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> matrix 
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mi>&#x2113;</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
are the inputs to the subsequent layer <inline-formula id="ieqn-259"><mml:math id="mml-ieqn-259"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, gathered in the matrix <inline-formula id="ieqn-260"><mml:math id="mml-ieqn-260"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-261"><mml:math id="mml-ieqn-261"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> being the <italic>width</italic> of layer <inline-formula id="ieqn-262"><mml:math id="mml-ieqn-262"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p> 
<statement id="st4_2"><title><xref ref-type="statement" rid="st4_2">Remark 4.2</xref>.</title>
<p>The output for layer <inline-formula id="ieqn-263"><mml:math id="mml-ieqn-263"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, denoted by <inline-formula id="ieqn-264"><mml:math id="mml-ieqn-264"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, can also be written as <inline-formula id="ieqn-265"><mml:math id="mml-ieqn-265"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, where &#x201C;<inline-formula id="ieqn-266"><mml:math id="mml-ieqn-266"><mml:mi>h</mml:mi></mml:math></inline-formula>&#x201D; is mnemonic for &#x201C;hidden&#x201D;, since the inner layers between the input layer <inline-formula id="ieqn-267"><mml:math id="mml-ieqn-267"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and the output layer <inline-formula id="ieqn-268"><mml:math id="mml-ieqn-268"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> are considered as being &#x201C;hidden&#x201D;. Both notations <inline-formula id="ieqn-269"><mml:math id="mml-ieqn-269"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-270"><mml:math id="mml-ieqn-270"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are equivalent</p>
<p><disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and can be used interchangeably. In the current Section <xref ref-type="sec" rid="s4">4</xref> on &#x201C;Static, feedforward networks&#x201D;, the notation <inline-formula id="ieqn-271"><mml:math id="mml-ieqn-271"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is used, whereas in Section <xref ref-type="sec" rid="s7">7</xref> on &#x201C;Dynamics, sequential data, sequence modeling&#x201D;, the notation <inline-formula id="ieqn-272"><mml:math id="mml-ieqn-272"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is used to designate the output of the &#x201C;hidden cell&#x201D; at state <inline-formula id="ieqn-273"><mml:math id="mml-ieqn-273"><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> in a recurrent neural network, keeping in mind the equivalence in Eq. (<xref ref-type="disp-formula" rid="eqn-276">276</xref>) in Remark <xref ref-type="statement" rid="st7_1">7.1</xref>. Whenever necessary, readers are reminded of the equivalence in Eq. (<xref ref-type="disp-formula" rid="eqn-22">22</xref>) to avoid possible confusion when reading deep-learning literature.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement><p>The above chain in Eq. (<xref ref-type="disp-formula" rid="eqn-18">18</xref>)&#x2014;see also Eq. (<xref ref-type="disp-formula" rid="eqn-23">23</xref>) and Figure <xref ref-type="fig" rid="fig-23">23</xref>&#x2014;is referred to as &#x201C;multiple levels of composition&#x201D; that characterizes modern deep learning, which no longer attempts to mimic the working of the brain from the neuroscientific perspective.<xref ref-type="fn" rid="fn60"><sup>60</sup></xref><fn id="fn60"><label>60</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 14, p.163.</p></fn> Besides, a complete understanding of how the brain functions is still far remote.<xref ref-type="fn" rid="fn61"><sup>61</sup></xref><fn id="fn61"><label>61</label><p>In the review paper [<xref ref-type="bibr" rid="ref-12">12</xref>] addressing to computer-science experts, and dense with acronyms and jargon &#x201C;foreign&#x201D; to first-time learners, the authors mentioned &#x201C;It is ironic that artificial NNs [neural networks] (ANNs) can help to better understand biological NNs (BNNs)&#x201D;, and cited a 2012 paper that won an &#x201C;image segmentation&#x201D; contest in helping to construct a 3-D model of the &#x201C;brain&#x2019;s neurons and dendrites&#x201D; from &#x201C;electron microscopy images of stacks of thin slices of animal brains&#x201D;.</p></fn></p>
<sec id="s4_3_1"><label>4.3.1</label>
<title>Graphical representation, block diagrams</title>
<p>A function can be graphically represented as in Figure <xref ref-type="fig" rid="fig-22">22</xref>.</p>
<fig id="fig-22">
<label>Figure 22</label>
<caption><title><italic>Function mapping, graphical representation</italic> (Section <xref ref-type="sec" rid="s4_3_1">4.3.1</xref>): <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>n</mml:mi></mml:math></inline-formula> inputs in <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (<inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> column matrix of real numbers) are fed into function <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>f</mml:mi></mml:math></inline-formula> to produce <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>m</mml:mi></mml:math></inline-formula> outputs in <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-22.tif"/>
</fig>
<p>The multiple levels of compositions in Eq. (<xref ref-type="disp-formula" rid="eqn-18">18</xref>) can then be represented by</p>
<p><disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:mrow><mml:munder><mml:munder><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy='true'>&#xFE38;</mml:mo></mml:munder><mml:mrow><mml:mtext>Input</mml:mtext></mml:mrow></mml:munder><mml:mspace width="1em" /><mml:munder><mml:munder><mml:mrow><mml:mover><mml:mo>&#x2192;</mml:mo><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mover><mml:mspace width="0.5em" /><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mspace width="0.5em" /><mml:mover><mml:mo>&#x2192;</mml:mo><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mover><mml:mspace width="0.5em" /><mml:mo>&#x22EF;</mml:mo><mml:mspace width="0.5em" /><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mspace width="0.5em" /><mml:mover><mml:mo>&#x2192;</mml:mo><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mover><mml:mspace width="0.5em" /><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mspace width="0.5em" /><mml:mo>&#x22EF;</mml:mo><mml:mspace width="0.5em" /><mml:mover><mml:mo>&#x2192;</mml:mo><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mover><mml:mspace width="0.5em" /><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mspace width="0.5em" /><mml:mover><mml:mo>&#x2192;</mml:mo><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mover></mml:mrow><mml:mo stretchy='true'>&#xFE38;</mml:mo></mml:munder><mml:mrow><mml:mtext>Network&#x00A0;as&#x00A0;multilevel&#x00A0;composition&#x00A0;of&#x00A0;functions</mml:mtext></mml:mrow></mml:munder><mml:mspace width="1em" /><mml:munder><mml:munder><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy='true'>&#xFE38;</mml:mo></mml:munder><mml:mrow><mml:mtext>Output</mml:mtext></mml:mrow></mml:munder></mml:mrow></mml:math></disp-formula></p>
<p>revealing the structure of the <italic>feedforward network</italic> as a multilevel composition of functions (or chain-based network) in which the output <inline-formula id="ieqn-274"><mml:math id="mml-ieqn-274"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of the previous layer <inline-formula id="ieqn-275"><mml:math id="mml-ieqn-275"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> serves as the input for the current layer <inline-formula id="ieqn-276"><mml:math id="mml-ieqn-276"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, to be processed by the function <inline-formula id="ieqn-277"><mml:math id="mml-ieqn-277"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> to produce the output <inline-formula id="ieqn-278"><mml:math id="mml-ieqn-278"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. the input <inline-formula id="ieqn-279"><mml:math id="mml-ieqn-279"><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> for the input layer <inline-formula id="ieqn-280"><mml:math id="mml-ieqn-280"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the input for the entire network. The output <inline-formula id="ieqn-281"><mml:math id="mml-ieqn-281"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> of the (last) layer <inline-formula id="ieqn-282"><mml:math id="mml-ieqn-282"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the predicted output for the entire network.</p>
<fig id="fig-23">
<label>Figure 23</label>
<caption><title><italic>Feedforward network</italic> (Sections <xref ref-type="sec" rid="s4_3_1">4.3.1</xref>, <xref ref-type="sec" rid="s4_4_4">4.4.4</xref>): Multilevel composition in feedforward network with <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>L</mml:mi></mml:math></inline-formula> layers represented as a sequential application of functions <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula>, to <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>n</mml:mi></mml:math></inline-formula> inputs gathered in <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (<inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> column matrix of real numbers) to produce <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>m</mml:mi></mml:math></inline-formula> outputs gathered in <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. This figure is a higher-level block diagram that corresponds to the lower-level neural network in Figure <xref ref-type="fig" rid="fig-7">7</xref> or in Figure <xref ref-type="fig" rid="fig-35">35</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-23.tif"/>
</fig>
<statement id="st4_3"><title><xref ref-type="statement" rid="st4_3">Remark 4.3</xref>.</title>
<p><italic>Layer definitions, action layers, state layers.</italic> In Eq. (<xref ref-type="disp-formula" rid="eqn-23">23</xref>) and in Figure <xref ref-type="fig" rid="fig-23">23</xref>, an <italic>action layer</italic> is defined by the action, i.e., the function <inline-formula id="ieqn-283"><mml:math id="mml-ieqn-283"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, on the inputs <inline-formula id="ieqn-284"><mml:math id="mml-ieqn-284"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> to produce the outputs <inline-formula id="ieqn-285"><mml:math id="mml-ieqn-285"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. There are <inline-formula id="ieqn-286"><mml:math id="mml-ieqn-286"><mml:mi>L</mml:mi></mml:math></inline-formula> action layers. A <italic>state layer</italic> is a collection of inputs or outputs, i.e., <inline-formula id="ieqn-287"><mml:math id="mml-ieqn-287"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula>, each describes a state of the system, thence the number of state layers is <inline-formula id="ieqn-288"><mml:math id="mml-ieqn-288"><mml:mi>L</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and the number of hidden (state) layers (excluding the input layer <inline-formula id="ieqn-289"><mml:math id="mml-ieqn-289"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the output layer <inline-formula id="ieqn-290"><mml:math id="mml-ieqn-290"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>) is <inline-formula id="ieqn-291"><mml:math id="mml-ieqn-291"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. For an illustration of state layers, see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 6, Figure1.2. See also Remark <xref ref-type="statement" rid="st11_3">11.3</xref>. From here on, &#x201C;hidden layer&#x201D; means &#x201C;hidden <italic>state</italic> layer&#x201D;, agreeing with the terminology in [<xref ref-type="bibr" rid="ref-78">78</xref>]. See also Remark <xref ref-type="statement" rid="st4_5">4.5</xref> on depth definitions in Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref> on &#x201C;Depth, size&#x201D;.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec>
</sec>
<sec id="s4_4"><label>4.4</label>
<title>Network layer, detailed construct</title>
<sec id="s4_4_1"><label>4.4.1</label>
<title>Linear combination of inputs and biases</title>
<p>First, an affine transformation on the inputs (see Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>)) is carried out, in which the coefficients of the inputs are called the weights, and the constants are called the biases. The output <inline-formula id="ieqn-292"><mml:math id="mml-ieqn-292"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of layer <inline-formula id="ieqn-293"><mml:math id="mml-ieqn-293"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the input to layer <inline-formula id="ieqn-294"><mml:math id="mml-ieqn-294"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></p>
<p><disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The column matrix <inline-formula id="ieqn-295"><mml:math id="mml-ieqn-295"><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></p>
<p><disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>is a linear combination of the inputs in <inline-formula id="ieqn-296"><mml:math id="mml-ieqn-296"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> plus the biases (i.e., an affine transformation)<xref ref-type="fn" rid="fn62"><sup>62</sup></xref><fn id="fn62"><label>62</label><p>See Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>) for the continuous temporal summation, counterpart of the discrete spatial summation in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>).</p></fn></p>
<p><disp-formula id="eqn-26"><label>(26)</label><mml:math id="mml-eqn-26" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;such that&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the <inline-formula id="ieqn-297"><mml:math id="mml-ieqn-297"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> matrix <inline-formula id="ieqn-298"><mml:math id="mml-ieqn-298"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula> contains the weights<xref ref-type="fn" rid="fn63"><sup>63</sup></xref><fn id="fn63"><label>63</label><p>It should be noted that the use of both <inline-formula id="ieqn-3047"><mml:math id="mml-ieqn-3047"><mml:mi mathvariant='bold-italic'>W</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-3048"><mml:math id="mml-ieqn-3048"><mml:msup><mml:mi mathvariant='bold-italic'>W</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-78">78</xref>] in equations equivalent to Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) is confusing. For example, on p. 205, in Section <xref ref-type="sec" rid="s6_5_4">6.5.4</xref> on backpropagation for fully-connected feedforward network, Algorithm 6.3, an equation that uses <inline-formula id="ieqn-3049"><mml:math id="mml-ieqn-3049"><mml:mi mathvariant='bold-italic'>W</mml:mi></mml:math></inline-formula> in the same manner as Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) is <inline-formula id="ieqn-3050"><mml:math id="mml-ieqn-3050"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:math></inline-formula>, whereas on p. 191, in Section <xref ref-type="sec" rid="s6_4">6.4</xref>, Architecture Design, Eq. (6.40) uses <inline-formula id="ieqn-3051"><mml:math id="mml-ieqn-3051"><mml:msup><mml:mi mathvariant='bold-italic'>W</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> and reads as <inline-formula id="ieqn-3052"><mml:math id="mml-ieqn-3052"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi>G</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which is similar to Eq. (6.36) on p. 187. On the other hand, both <inline-formula id="ieqn-3053"><mml:math id="mml-ieqn-3053"><mml:mi mathvariant='bold-italic'>W</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-3054"><mml:math id="mml-ieqn-3054"><mml:msup><mml:mi mathvariant='bold-italic'>W</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> appear on the same p. 190 in the expressions <inline-formula id="ieqn-3055"><mml:math id="mml-ieqn-3055"><mml:mtext>cos</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant='bold-italic'>W</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-3056"><mml:math id="mml-ieqn-3056"><mml:mrow><mml:mrow><mml:mi mathvariant='bold-italic'>h</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant='bold-italic'>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Here, we stick to a single definition of <inline-formula id="ieqn-3057"><mml:math id="mml-ieqn-3057"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant='bold-italic'>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:math></inline-formula> as defined in Eq. (<xref ref-type="disp-formula" rid="eqn-27">27</xref>) and used in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>).</p></fn></p>
<p><disp-formula id="eqn-27"><label>(27)</label><mml:math id="mml-eqn-27" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and the <inline-formula id="ieqn-299"><mml:math id="mml-ieqn-299"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> column matrix <inline-formula id="ieqn-300"><mml:math id="mml-ieqn-300"><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> the biases:<xref ref-type="fn" rid="fn64"><sup>64</sup></xref><fn id="fn64"><label>64</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) is a linear (additive) combination of inputs with possibly non-zero biases. An additive combination of inputs with zero bias, and a &#x201C;multiplicative&#x201D; combination of inputs of the form <inline-formula id="ieqn-3058"><mml:math id="mml-ieqn-3058"><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> with zero bias, were mentioned in [<xref ref-type="bibr" rid="ref-12">12</xref>]. In [<xref ref-type="bibr" rid="ref-112">112</xref>], the author went even further to propose the general case in which <inline-formula id="ieqn-3059"><mml:math id="mml-ieqn-3059"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>k</mml:mi><mml:mo>&lt;</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-3060"><mml:math id="mml-ieqn-3060"><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is <italic>any</italic> differentiable function. But it is not clear whether any of these more complex functions of the inputs were used in practice, as we have not seen any such use, e.g., in [<xref ref-type="bibr" rid="ref-21">21</xref>] [<xref ref-type="bibr" rid="ref-78">78</xref>], and many other articles, including review articles such as [<xref ref-type="bibr" rid="ref-13">13</xref>] [<xref ref-type="bibr" rid="ref-20">20</xref>]. On the other hand, the additive combination has a clear theoretical foundation as the linear-order approximation to the Volterra series Eq. (<xref ref-type="disp-formula" rid="eqn-496">496</xref>); see Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>) and also [<xref ref-type="bibr" rid="ref-19">19</xref>].</p></fn></p>
<p><disp-formula id="eqn-28"><label>(28)</label><mml:math id="mml-eqn-28" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mtext>.</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Both the weights and the biases are collectively known as the network parameters, defined in the following matrices for layer <inline-formula id="ieqn-301"><mml:math id="mml-ieqn-301"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-29"><label>(29)</label><mml:math id="mml-eqn-29" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-30"><label>(30)</label><mml:math id="mml-eqn-30" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#xA0;</mml:mtext><mml:mo>|</mml:mo></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For simplicity and convenience, the set of all parameters in the network is denoted by <inline-formula id="ieqn-302"><mml:math id="mml-ieqn-302"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula>, and the set of all parameters in layer <inline-formula id="ieqn-303"><mml:math id="mml-ieqn-303"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by <inline-formula id="ieqn-304"><mml:math id="mml-ieqn-304"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>:<xref ref-type="fn" rid="fn65"><sup>65</sup></xref><fn id="fn65"><label>65</label><p>For the convenience in further reading, wherever possible, we use the same notation as in [<xref ref-type="bibr" rid="ref-78">78</xref>], p.xix.</p></fn></p>
<p><disp-formula id="eqn-31"><label>(31)</label><mml:math id="mml-eqn-31" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;such that&#xA0;</mml:mtext></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Note that the set <inline-formula id="ieqn-305"><mml:math id="mml-ieqn-305"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-31">31</xref>) is not a matrix, but a set of matrices, since the number of rows <inline-formula id="ieqn-306"><mml:math id="mml-ieqn-306"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> for a layer <inline-formula id="ieqn-307"><mml:math id="mml-ieqn-307"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> may vary for different values of <inline-formula id="ieqn-308"><mml:math id="mml-ieqn-308"><mml:mi>&#x2113;</mml:mi></mml:math></inline-formula>, even though in practice, the widths of the layers in a fully connected feed-forward network may generally be chosen to be the same.</p>
<p>Similar to the definition of the parameter matrix <inline-formula id="ieqn-309"><mml:math id="mml-ieqn-309"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-30">30</xref>), which includes the biases <inline-formula id="ieqn-310"><mml:math id="mml-ieqn-310"><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, it is convenient for use later in elucidating the backpropagation method in Section <xref ref-type="sec" rid="s5">5</xref> (and Section <xref ref-type="sec" rid="s5_2">5.2</xref> in particular) to expand the matrix <inline-formula id="ieqn-311"><mml:math id="mml-ieqn-311"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) into the matrix <inline-formula id="ieqn-312"><mml:math id="mml-ieqn-312"><mml:msup><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> (with an overbar) as follows:</p>
<p><disp-formula id="eqn-32"><label>(32)</label><mml:math id="mml-eqn-32" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2261;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#xA0;</mml:mtext><mml:mo>|</mml:mo></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=:</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#xA0;</mml:mtext><mml:msup><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with</p>
<p><disp-formula id="eqn-33"><label>(33)</label><mml:math id="mml-eqn-33" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>:=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#xA0;</mml:mtext><mml:mo>|</mml:mo></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:msup><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>:=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The total number of parameters of a fully-connected feedforward network is then</p>
<p><disp-formula id="eqn-34"><label>(34)</label><mml:math id="mml-eqn-34" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo>:=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>But why using a linear (additive) combination (or superposition) of inputs with weights, plus biases, as expressed in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) ? See Section <xref ref-type="sec" rid="s13_2">13.2</xref>.</p></sec>
<sec id="s4_4_2"><label>4.4.2</label>
<title>Activation functions</title>
<p>An activation function <inline-formula id="ieqn-313"><mml:math id="mml-ieqn-313"><mml:mi>a</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula>, which is a nonlinear real-valued function, is used to decide when the information in its argument is relevant for a neuron to activate. In other words, an activation function filters out information deemed insignificant, and is applied <italic>element-wise</italic> to the matrix <inline-formula id="ieqn-314"><mml:math id="mml-ieqn-314"><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>), obtained as a linear combination of the inputs plus the biases:</p>
<p><disp-formula id="eqn-35"><label>(35)</label><mml:math id="mml-eqn-35" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#xA0;such that&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Without the activation function, the neural network is simply a linear regression, and cannot learn and perform complex tasks, such as image classification, language translation, guiding a driver-less car, etc. See Figure <xref ref-type="fig" rid="fig-32">32</xref> for the block diagram of a one-layer network.</p>
<p>An example is a linear one-layer network, without activation function, being unable to represent the seemingly simple XOR (exclusive-or) function, which brought down the first wave of AI (cybernetics), and that is described in Section <xref ref-type="sec" rid="s4_5">4.5</xref>.</p>
<p><bold>Rectified linear units (ReLU).</bold> Nowadays, for the choice of activation function <inline-formula id="ieqn-315"><mml:math id="mml-ieqn-315"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, Most modern <italic>large</italic> deep-learning networks use the default,<xref ref-type="fn" rid="fn66"><sup>66</sup></xref><fn id="fn66"><label>66</label><p>&#x201C;In modern neural networks, The default recommendation is to use the rectified linear unit, or ReLU,&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 168.</p></fn> well-proven <italic>rectified linear function</italic> (more often known as the &#x201C;positive part&#x201D; function) defined as<xref ref-type="fn" rid="fn67"><sup>67</sup></xref><fn id="fn67"><label>67</label><p>The notation <inline-formula id="ieqn-3061"><mml:math id="mml-ieqn-3061"><mml:msup><mml:mi>z</mml:mi><mml:mo>+</mml:mo></mml:msup></mml:math></inline-formula> for positive part function is used in the mathematics literature, e.g., &#x201C;Positive and negative parts&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Positive_and_negative_parts&amp;oldid=830205996">version 12:11, 13 March 2018</ext-link>, and less frequently in the computer-science literature, e.g., [<xref ref-type="bibr" rid="ref-33">33</xref>]. The notation <inline-formula id="ieqn-3062"><mml:math id="mml-ieqn-3062"><mml:mo stretchy="false">[</mml:mo><mml:mi>z</mml:mi><mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo></mml:msub></mml:math></inline-formula> is found in the neuroscience literature, e.g., [<xref ref-type="bibr" rid="ref-32">32</xref>] [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 63. The notation <inline-formula id="ieqn-3063"><mml:math id="mml-ieqn-3063"><mml:mtext>max</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is more widely used in the computer-science literature, e.g., [<xref ref-type="bibr" rid="ref-34">34</xref>] [<xref ref-type="bibr" rid="ref-113">113</xref>], [<xref ref-type="bibr" rid="ref-78">78</xref>].</p></fn></p>
<p><disp-formula id="eqn-36"><label>(36)</label><mml:math id="mml-eqn-36" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mo>+</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>z</mml:mi><mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo></mml:msub><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>z</mml:mi></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mn>0</mml:mn><mml:mo>&lt;</mml:mo><mml:mi>z</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and depicted in Figure <xref ref-type="fig" rid="fig-24">24</xref>, for which the processing unit is called the <italic>rectified linear unit</italic> (ReLU),<xref ref-type="fn" rid="fn68"><sup>68</sup></xref><fn id="fn68"><label>68</label><p>A similar relation can be applied to define the Leaky ReLU in Eq. (<xref ref-type="disp-formula" rid="eqn-40">40</xref>).</p></fn> which was demonstrated to be superior to other activation functions in many problems.<xref ref-type="fn" rid="fn69"><sup>69</sup></xref><fn id="fn69"><label>69</label><p>In [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 15, the authors cited the original papers [<xref ref-type="bibr" rid="ref-33">33</xref>] and [<xref ref-type="bibr" rid="ref-34">34</xref>], where ReLU was introduced in the context of image / object recognition, and [<xref ref-type="bibr" rid="ref-113">113</xref>], where the superiority of ReLU over hyperbolic-tangent units and sigmoidal units was demonstrated.</p></fn> Therefore, in this section, we discuss in detail the rectified linear function, with careful explanation and motivation. It is important to note that ReLU is superior for <italic>large</italic> network size, and may have about the same, or less, accuracy than the older logistic sigmoid function for &#x201C;very small&#x201D; networks, while requiring less computational efforts.<xref ref-type="fn" rid="fn70"><sup>70</sup></xref><fn id="fn70"><label>70</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 219, and Section <xref ref-type="sec" rid="s13_3">13.3</xref> on the history of active functions. See also Section <xref ref-type="sec" rid="s4_6">4.6</xref> for a discussion of network size. The reason for less computational effort with ReLU is due to (1) it being an identity map for positive argument, (2) zero for negative argument, and (3) its first derivative being the Step (Heaviside) function as shown in Figure <xref ref-type="fig" rid="fig-24">24</xref>, and explained in Section <xref ref-type="sec" rid="s5">5</xref> on Backpropagation.</p></fn></p>
<fig id="fig-24">
<label>Figure 24</label>
<caption><title><italic>Activation function</italic> (Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>): Rectified linear function and its derivatives. See also Section <xref ref-type="sec" rid="s5_3_3">5.3.3</xref> and Figure <xref ref-type="fig" rid="fig-54">54</xref> for Parametric ReLU that helped surpass human level performance in ImageNet competition for the first time in 2015, Figure <xref ref-type="fig" rid="fig-3">3</xref> [<xref ref-type="bibr" rid="ref-61">61</xref>]. See also Figure <xref ref-type="fig" rid="fig-26">26</xref> for a halfwave rectifier.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-24.tif"/>
</fig>
<p>To transform an alternative current into a direct current, the first step is to rectify the alternative current by eliminating its negative parts, and thus The meaning of the adjective &#x201C;rectified&#x201D; in <italic>rectified linear unit</italic> (ReLU). Figure <xref ref-type="fig" rid="fig-25">25</xref> shows the current-voltage relation for an ideal diode, for a resistance, which is in series with the diode, and for the resulting ReLU function that rectifies an alternative current as input into the halfwave rectifier circuit in Figure <xref ref-type="fig" rid="fig-26">26</xref>, resulting in a halfwave current as output.</p>
<fig id="fig-25">
<label>Figure 25</label>
<caption><title><italic>Current I versus voltage V</italic> (Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>): Ideal diode, resistance, <italic>scaled</italic> rectified linear function as activation (transfer) function for the ideal diode and resistance in series. (Figure plotted with <italic>R</italic> = 2.) See also Figure <xref ref-type="fig" rid="fig-26">26</xref> for a halfwave rectifier.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-25.tif"/>
</fig>
<p>Mathematically, a periodic function remains periodic after passing through a (nonlinear) rectifier (active function):</p>
<p><disp-formula id="eqn-37"><label>(37)</label><mml:math id="mml-eqn-37" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x27F9;</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-316"><mml:math id="mml-ieqn-316"><mml:mi>T</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-37">37</xref>) is the period of the input current <inline-formula id="ieqn-317"><mml:math id="mml-ieqn-317"><mml:mi>z</mml:mi></mml:math></inline-formula>.</p>
<p>Biological neurons encode and transmit information over long distance by generating (firing) electrical pulses called action potentials or spikes with a wide range of frequencies [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 1; see Figure <xref ref-type="fig" rid="fig-27">27</xref>. &#x201C;To reliably encode a wide range of signals, neurons need to achieve a broad range of firing frequencies and to move smoothly between low and high firing rates&#x201D; [<xref ref-type="bibr" rid="ref-114">114</xref>]. From the neuroscientific standpoint, the rectified linear function could be motivated as an idealization of the &#x201C;Type I&#x201D; relation between the firing rate (F) of a biological neuron and the input current (I), called the FI curve. Figure <xref ref-type="fig" rid="fig-27">27</xref> describes three types of FI curves, with Type I in the middle subfigure, where there is a continuous increase in the firing rate with increase in input current.</p>
<fig id="fig-26">
<label>Figure 26</label>
<caption><title><italic>Halfwave rectifier circuit</italic> (Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>), with a primary alternative current <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>z</mml:mi></mml:math></inline-formula> going in as input (left), passing through a transformer to lower the voltage amplitude, with the secondary alternative current out of the transformer being put through a closed circuit with an ideal diode <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow></mml:math></inline-formula> and a resistor <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow></mml:math></inline-formula> in series, resulting in a halfwave output current, which can be grossly approximated by the <italic>scaled</italic> rectified linear function <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>y</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (right) as shown in Figure <xref ref-type="fig" rid="fig-25">25</xref>, with scaling factor <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow></mml:math></inline-formula>. The rectified linear unit in Figure <xref ref-type="fig" rid="fig-24">24</xref> corresponds to the case with <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. For a more accurate Shockley diode model, The relation between current <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>I</mml:mi></mml:math></inline-formula> and voltage <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mi>V</mml:mi></mml:math></inline-formula> for this circuit is given in Figure <xref ref-type="fig" rid="fig-29">29</xref>. Figure based on source in Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://commons.wikimedia.org/w/index.php?title=File:Halfwave.rectifier.en.svg&amp;oldid=145584715.tif">version 01:49, 7 January 2015</ext-link>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-26.tif"/>
</fig>
<p>The Shockley equation for a current <inline-formula id="ieqn-324"><mml:math id="mml-ieqn-324"><mml:mi>I</mml:mi></mml:math></inline-formula> going through a diode <inline-formula id="ieqn-325"><mml:math id="mml-ieqn-325"><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow></mml:math></inline-formula>, in terms of the voltage <inline-formula id="ieqn-326"><mml:math id="mml-ieqn-326"><mml:msub><mml:mi>V</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:math></inline-formula> across the diode, is given in mathematical form as:</p>
<p><disp-formula id="eqn-38"><label>(38)</label><mml:math id="mml-eqn-38" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>I</mml:mi><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>q</mml:mi><mml:msub><mml:mi>V</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo stretchy="false">&#x27F9;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:msub><mml:mi>V</mml:mi><mml:mi>D</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>q</mml:mi></mml:mfrac><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mi>I</mml:mi><mml:mi>p</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mtext>.</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>With the voltage across the resistance being <inline-formula id="ieqn-327"><mml:math id="mml-ieqn-327"><mml:msub><mml:mi>V</mml:mi><mml:mi>R</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula>, the voltage across the diode and the resistance in series is then</p>
<p><disp-formula id="eqn-39"><label>(39)</label><mml:math id="mml-eqn-39" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mi>D</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mi>R</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:msub><mml:mi>V</mml:mi><mml:mi>D</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>q</mml:mi></mml:mfrac><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mi>I</mml:mi><mml:mi>p</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>R</mml:mi><mml:mi>I</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is plotted in Figure <xref ref-type="fig" rid="fig-29">29</xref>. The rectified linear function could be seen from Figure <xref ref-type="fig" rid="fig-29">29</xref> as a very rough approximation of the current-voltage relation in a halfwave rectifier circuit in Figure <xref ref-type="fig" rid="fig-26">26</xref>, in which a diode and a resistance are in series. In the Shockley model, the diode is leaky in the sense that there is a small amount of current flow when the polarity is reversed, unlike the case of an ideal diode or ReLU (Figure <xref ref-type="fig" rid="fig-24">24</xref>), and is better modeled by the Leaky ReLU activation function, in which there is a small positive (instead of just flat zero) slope for negative <inline-formula id="ieqn-340"><mml:math id="mml-ieqn-340"><mml:mi>z</mml:mi></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-40"><label>(40)</label><mml:math id="mml-eqn-40" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0.01</mml:mn><mml:mi>z</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>0.01</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>z</mml:mi></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>z</mml:mi></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mn>0</mml:mn><mml:mo>&lt;</mml:mo><mml:mi>z</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Prior to the introduction of ReLU, which had been long widely used in neuroscience as activation function prior to 2011,<xref ref-type="fn" rid="fn71"><sup>71</sup></xref><fn id="fn71"><label>71</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 14, where ReLU was called the &#x201C;half-wave rectification operation&#x201D;, the meaning of which is explained above in Figure <xref ref-type="fig" rid="fig-26">26</xref>. The logistic sigmoid function (Figure <xref ref-type="fig" rid="fig-30">30</xref>) was also used in neuroscience since the 1950s.</p></fn> the state-of-the-art for deep-learning activation function was the hyperbolic tangent (Figure <xref ref-type="fig" rid="fig-31">31</xref>), which performed better than the widely used, and much older, sigmoid function<xref ref-type="fn" rid="fn72"><sup>72</sup></xref><fn id="fn72"><label>72</label><p>See Section <xref ref-type="sec" rid="s13_3_1">13.3.1</xref> for a history of the sigmoid function, which dated back at least to 1974 in neuroscience.</p></fn> (Figure <xref ref-type="fig" rid="fig-30">30</xref>); see [<xref ref-type="bibr" rid="ref-113">113</xref>], in which it was reported that</p>
<disp-quote>
<fig id="fig-27">
<label>Figure 27</label>
<caption><title><italic>FI curves</italic> (Sections <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>, <xref ref-type="sec" rid="s13_2_2">13.2.2</xref>). Firing rate frequency (F) versus applied depolarizing current (I), thus FI curves. Three types of FI curves. The time histories of voltage <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi>V</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:math></inline-formula> provide a visualization of the spikes, current threshold, and spike firing rates. The applied (input) current <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>p</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in increased gradually until it passes a current threshold, then the neuron begins to fire. Two input current levels (two black dots on FI curves at the bottom) near the current threshold are shown, with one just below the threshold (black-line time history for <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>p</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and and one just above the threshold (blue line). two corresponding histories of voltage <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>V</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:math></inline-formula> (flat black line, and blue line with spikes) are also shown. Type I displays a continuous increase in firing frequency from zero to higher values when the current continues to increase past the current threshold. Type II displays a discontinuity in firing frequency, with a sudden jump from zero to a finite frequency, when the current passes the threshold. At low concentration <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msub><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mi>A</mml:mi></mml:msub></mml:math></inline-formula> of potassium, the neuron exhibits Type-II FI curve, then transitions to Type-I FI curve as <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mi>A</mml:mi></mml:msub></mml:math></inline-formula> is increased, and returns to Type-II<inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msup><mml:mi></mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> for higher concentration <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mi>A</mml:mi></mml:msub></mml:math></inline-formula>. see [<xref ref-type="bibr" rid="ref-114">114</xref>]. the <italic>scaled</italic> rectified linear unit (scaled ReLU, Figure <xref ref-type="fig" rid="fig-25">25</xref> and Figure <xref ref-type="fig" rid="fig-26">26</xref>) can be viewed as approximating Type-I FI curve, see also Figure <xref ref-type="fig" rid="fig-28">28</xref> and Eq. (<xref ref-type="disp-formula" rid="eqn-505">505</xref>) where the FI curve is used in biological neuron firing-rate models. <ext-link ext-link-type="uri" xlink:href="https://www.pnas.org/page/about/rights-permissions.tif">Permission of NAS</ext-link>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-27.tif"/>
</fig>
<p>&#x201C;While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. Rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero.&#x201D;</p>
</disp-quote>
<p>The hard non-linearity of ReLU is localized at zero, but otherwise ReLU is a very simple function&#x2014;identity map for positive argument, zero for negative argument&#x2014;making it highly efficient for computation.</p>
<p>Also, due to errors in numerical computation, it is rare to hit exactly zero, where there is a hard non-linearity in ReLU:</p>
<disp-quote><p>&#x201C;In the case of <inline-formula id="ieqn-341"><mml:math id="mml-ieqn-341"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the left derivative at <inline-formula id="ieqn-342"><mml:math id="mml-ieqn-342"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> is 0, and the right derivative is 1. Software implementations of neural network training usually return one of the one-sided derivatives rather than reporting that the derivative is undefined or raising an error. This may be heuristically justified by observing that gradient-based optimization on a digital computer is subject to numerical error anyway. When a function is asked to evaluate <inline-formula id="ieqn-345"><mml:math id="mml-ieqn-345"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, it is very unlikely that the underlying value truly was 0. Instead, it was likely to be some small value that was rounded to 0.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 186.</p>
</disp-quote>
<fig id="fig-28">
<label>Figure 28</label>
<caption><title><italic>FI or FV curves</italic> (Sections <xref ref-type="sec" rid="s3">3</xref>, <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>, <xref ref-type="sec" rid="s13_2_2">13.2.2</xref>). Neuron firing rate (F) versus input current (I) (FI curves, a,b,c) or voltage (V). The Integrate-and-Fire model in SubFigure (c) can be used to replace the sigmoid function to fit the experimental data points in SubFigure (a). The ReLU function in Figure <xref ref-type="fig" rid="fig-24">24</xref> can be used to approximate the region of the FI curve just beyond the current or voltage thresholds, as indicated in the red rectangle in SubFigure (c). Despite the advanced mathematics employed to produce the Type-II FI curve of a large number of Type-I neurons, as shown in SubFigure (b), it is not clear whether a similar result would be obtained if the single neuron displays a behavior as in Figure <xref ref-type="fig" rid="fig-27">27</xref>, with transition from Type II to Type I to Type II&#x002A; in a single neuron. See Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> on &#x201C;Dynamic, time dependence, Volterra series&#x201D; for more discussion on Wilson&#x2019;s equations Eqs. (<xref ref-type="disp-formula" rid="eqn-508">508</xref>)-(<xref ref-type="disp-formula" rid="eqn-509">509</xref>) [<xref ref-type="bibr" rid="ref-118">118</xref>]. On the other hand for deep-learning networks, the above results are more than sufficient to motivate the use of ReLU, which has deep roots in neuroscience.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-28.tif"/>
</fig>
<fig id="fig-29">
<label>Figure 29</label>
<caption><title><italic>Halfwave rectifier</italic> (Sections <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>, <xref ref-type="sec" rid="s5_3_2">5.3.2</xref>). Current <italic>I</italic> versus voltage <italic>V</italic> [red line in SubFigure (b)] in the halfwave rectifier circuit of Figure <xref ref-type="fig" rid="fig-26">26</xref>, for which the ReLU function in Figure <xref ref-type="fig" rid="fig-24">24</xref> is a gross approximation. SubFigure (a) was plotted with <italic>p</italic> = 0.5, <italic>q</italic> &#x003D; &#x2212;1.2, <italic>R</italic> = 1. See also Figure <xref ref-type="fig" rid="fig-138">138</xref> for the synaptic response of crayfish similar to the red line in SubFigure (b).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-29.tif"/>
</fig>
<p>Thus, in addition to the ability to train deep networks, another advantage of using ReLU is the high efficiency in computing both the layer outputs and the gradients for use in optimizing the parameters (weights and biases) to lower cost or loss, i.e., training; see Section <xref ref-type="sec" rid="s6">6</xref> on Training, and in particular Section <xref ref-type="sec" rid="s6_3">6.3</xref> on Stochastic Gradient Descent.</p>
<p>The activation function ReLU approximates closer to how biological neurons work than other activation functions (e.g., logistic sigmoid, tanh, etc.), as it was established through experiments some sixty years ago, and have been used in neuroscience long (at least ten years) before being adopted in deep learning in 2011. Its use in deep learning is a clear influence from neuroscience; see Section <xref ref-type="sec" rid="s13_3">13.3</xref> on the history of activation functions, and Section <xref ref-type="sec" rid="s13_3_2">13.3.2</xref> on the history of the rectified linear function.</p>
<p>Deep-learning networks using ReLU mimic biological neural networks in the brain through a trade-off between two competing properties [<xref ref-type="bibr" rid="ref-113">113</xref>]:</p>
<list list-type="simple">
<list-item><label>(1)</label><p><italic>Sparsity</italic>. Only 1% to 4% of brain neurons are active at any one point in time. Sparsity saves brain energy. In deep networks, &#x201C;rectifying non-linearity gives rise to real zeros of activations and thus truly sparse representations.&#x201D; Sparsity provides representation robustness in that the non-zero features<xref ref-type="fn" rid="fn73"><sup>73</sup></xref><fn id="fn73"><label>73</label><p>See the definition of image &#x201C;predicate&#x201D; or image &#x201C;feature&#x201D; in Section <xref ref-type="sec" rid="s13_2_1">13.2.1</xref>, and in particular Footnote <xref ref-type="fn" rid="fn302">302</xref>.</p></fn> would have small changes for small changes of the data.</p></list-item>
<list-item><label>(2)</label><p><italic>Distributivity</italic>. Each feature of the data is represented distributively by many inputs, and each input is involved in distributively representing many features. Distributed representation is a key concept dated since the revival of connectionism with [<xref ref-type="bibr" rid="ref-119">119</xref>] [<xref ref-type="bibr" rid="ref-120">120</xref>] and others; see Section <xref ref-type="sec" rid="s13_2_1">13.2.1</xref>.</p></list-item></list>
<fig id="fig-30">
<label>Figure 30</label>
<caption><title><italic>Logistic sigmoid function</italic> (Sections <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>, <xref ref-type="sec" rid="s5_1_3">5.1.3</xref>, <xref ref-type="sec" rid="s5_3_1">5.3.1</xref>, <xref ref-type="sec" rid="s13_3_3">13.3.3</xref>): <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:math></inline-formula> (red), with the tangent at the origin <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> (blue). See also Remark <xref ref-type="statement" rid="st5_3">5.3</xref> and Figure <xref ref-type="fig" rid="fig-46">46</xref> on the softmax function. </title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-30.tif"/>
</fig><fig id="fig-31">
<label>Figure 31</label>
<caption><title> <italic>Hyperbolic tangent function</italic> (Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref>): <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> (red) and its tangent <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>z</mml:mi></mml:math></inline-formula> at the coordinate origin (blue), showing that this activation function is <italic>identity</italic> for small signals.
</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-31.tif"/>
</fig>
</sec>
<sec id="s4_4_3"><label>4.4.3</label>
<title>Graphical representation, block diagrams</title>
<p>The block diagram for a one-layer network is given in Figure <xref ref-type="fig" rid="fig-32">32</xref>, with more details in terms of the number of inputs and of outputs given in Figure <xref ref-type="fig" rid="fig-33">33</xref>.</p>
<fig id="fig-32">
<label>Figure 32</label>
<caption><title><italic>One-layer network</italic> (Section <xref ref-type="sec" rid="s4_4_3">4.4.3</xref>) representing the relation between the predicted output <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> and the input <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with the weighted sum <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>:=</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:math></inline-formula>; see Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-35">35</xref>) with <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. For a lower-level details of this one layer, see Figure <xref ref-type="fig" rid="fig-33">33</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-32.tif"/>
</fig>
<p>For a multilayer neural network with <inline-formula id="ieqn-347"><mml:math id="mml-ieqn-347"><mml:mi>L</mml:mi></mml:math></inline-formula> layers, with input-output relation shown in Figure <xref ref-type="fig" rid="fig-34">34</xref>, the detailed components are given in Figure <xref ref-type="fig" rid="fig-35">35</xref>, which generalizes Figure <xref ref-type="fig" rid="fig-33">33</xref> to layer <inline-formula id="ieqn-348"><mml:math id="mml-ieqn-348"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<fig id="fig-33">
<label>Figure 33</label>
<caption><title><italic>One-layer network</italic> (Section <xref ref-type="sec" rid="s4_4_3">4.4.3</xref>) in Figure <xref ref-type="fig" rid="fig-32">32</xref>: Lower level details, with <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>m</mml:mi></mml:math></inline-formula> processing units (rows or neurons), inputs <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> and predicted outputs <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>m</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-33.tif"/>
</fig>
<fig id="fig-34">
<label>Figure 34</label>
<caption><title><italic>Input-to-output mapping</italic> (Sections <xref ref-type="sec" rid="s4_4_3">4.4.3</xref>, <xref ref-type="sec" rid="s4_4_4">4.4.4</xref>): Layer <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in network with <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>L</mml:mi></mml:math></inline-formula> layers in Figure <xref ref-type="fig" rid="fig-23">23</xref>, input-to-output mapping <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> for layer <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-34.tif"/>
</fig>
<fig id="fig-35">
<label>Figure 35</label>
<caption><title><italic>Low-level details of layer</italic> <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (Sections <xref ref-type="sec" rid="s4_4_3">4.4.3</xref>, <xref ref-type="sec" rid="s4_4_4">4.4.4</xref>) of the multilayer neural network in Figure <xref ref-type="fig" rid="fig-23">23</xref>, with <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msub><mml:mi>m</mml:mi><mml:mi>&#x2113;</mml:mi></mml:msub></mml:math></inline-formula> as the number of processing units (rows or neurons), and thus the width of this layer, representing the layer processing (input-to-output) function <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-34">34</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-35.tif"/>
</fig>
</sec>
<sec id="s4_4_4"><label>4.4.4</label>
<title>Artificial neuron</title>
<p>And finally, we now complete our <italic>top-down</italic> descent from the big picture of the overall multilayer neural network with <inline-formula id="ieqn-349"><mml:math id="mml-ieqn-349"><mml:mi>L</mml:mi></mml:math></inline-formula> layers in Figure <xref ref-type="fig" rid="fig-23">23</xref>, through Figure <xref ref-type="fig" rid="fig-34">34</xref> for a typical layer <inline-formula id="ieqn-350"><mml:math id="mml-ieqn-350"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and Figure <xref ref-type="fig" rid="fig-35">35</xref> for the lower-level details of layer <inline-formula id="ieqn-351"><mml:math id="mml-ieqn-351"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, then down to the most basic level, a neuron in Figure <xref ref-type="fig" rid="fig-36">36</xref> as one row in layer <inline-formula id="ieqn-352"><mml:math id="mml-ieqn-352"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-35">35</xref>.</p>
<fig id="fig-36">
<label>Figure 36</label>
<caption><title><italic>Artificial neuron</italic> (Sections <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>, <xref ref-type="sec" rid="s4_4_4">4.4.4</xref>, <xref ref-type="sec" rid="s13_1">13.1</xref>), row <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>i</mml:mi></mml:math></inline-formula> in layer <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-35">35</xref>, representing the multiple-inputs-to-single-output relation <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. This block diagram is the exact equivalent of Figure <xref ref-type="fig" rid="fig-8">8</xref>, Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>, and in [<xref ref-type="bibr" rid="ref-38">38</xref>]. See Figure <xref ref-type="fig" rid="fig-131">131</xref> for the corresponding biological neuron in Section <xref ref-type="sec" rid="s13_1">13.1</xref> on &#x201C;Early inspiration from biological neurons&#x201D;.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-36.tif"/>
</fig>
</sec></sec>
<sec id="s4_5"><label>4.5</label>
<title>Representing XOR function with two-layer network</title>
<p>The XOR (exclusive-or) function played an important role in bringing down the first wave of AI, known as the cybernetics wave ([<xref ref-type="bibr" rid="ref-78">78</xref>], p. 14) since it was shown in [<xref ref-type="bibr" rid="ref-121">121</xref>] that Rosenblatt&#x2019;s perceptron (1958 [<xref ref-type="bibr" rid="ref-119">119</xref>], 1962 [<xref ref-type="bibr" rid="ref-120">120</xref>] [<xref ref-type="bibr" rid="ref-2">2</xref>]) could not represent the XOR function, defined in Table <xref ref-type="table" rid="table-2">2</xref>:</p>
<table-wrap id="table-2"><label>Table 2</label>
<caption><p><italic>Exclusive-or (XOR) function</italic> (Section <xref ref-type="sec" rid="s4_5">4.5</xref>) produces the <italic>True</italic> value only if two arguments are different. The symbol <inline-formula id="ieqn-353"><mml:math id="mml-ieqn-353"><mml:mo>&#x2295;</mml:mo></mml:math></inline-formula> (&#x201C;Oh-plus&#x201D;) denotes the XOR operator. A concrete example of the XOR function would be that there is one and only one of two poker player would be the winner, and there is no tie possible, i.e., both players cannot win, and both cannot lose.</p></caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr><th align="center"><inline-formula id="ieqn-354"><mml:math id="mml-ieqn-354"><mml:mi>j</mml:mi></mml:math></inline-formula></th>
<th align="center"><inline-formula id="ieqn-355"><mml:math id="mml-ieqn-355"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula></th>
<th align="center"><inline-formula id="ieqn-356"><mml:math id="mml-ieqn-356"><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> (XOR)</th></tr>
</thead>
<tbody>
<tr>
<td align="center">1</td>
<td align="center"><inline-formula id="ieqn-357"><mml:math id="mml-ieqn-357"><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula></td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center"><inline-formula id="ieqn-358"><mml:math id="mml-ieqn-358"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula></td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center"><inline-formula id="ieqn-359"><mml:math id="mml-ieqn-359"><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula></td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center"><inline-formula id="ieqn-360"><mml:math id="mml-ieqn-360"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula></td>
<td align="center">0</td> </tr>
</tbody>
</table>
</table-wrap>
<p>The dataset or design matrix<xref ref-type="fn" rid="fn74"><sup>74</sup></xref><fn id="fn74"><label>74</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 103.</p></fn> <inline-formula id="ieqn-361"><mml:math id="mml-ieqn-361"><mml:mi mathvariant="bold-italic">X</mml:mi></mml:math></inline-formula> is the collection of the coordinates of all four points in Table <xref ref-type="table" rid="table-2">2</xref>:</p>
<p><disp-formula id="eqn-41"><label>(41)</label><mml:math id="mml-eqn-41" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="center center center center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>An approximation (or prediction) for the XOR function <inline-formula id="ieqn-362"><mml:math id="mml-ieqn-362"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with <inline-formula id="ieqn-363"><mml:math id="mml-ieqn-363"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula> parameters is denoted by <inline-formula id="ieqn-364"><mml:math id="mml-ieqn-364"><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mi>f</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula>, with mean squared error (MSE) being:</p>
<p><disp-formula id="eqn-42"><label>(42)</label><mml:math id="mml-eqn-42" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>4</mml:mn></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>4</mml:mn></mml:munderover><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>4</mml:mn></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>4</mml:mn></mml:munderover><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>We begin with a one-layer network to show that it cannot represent the XOR function,<xref ref-type="fn" rid="fn75"><sup>75</sup></xref><fn id="fn75"><label>75</label><p>This one-layer network is not the Rosenblatt perceptron in Figure <xref ref-type="fig" rid="fig-132">132</xref> due to the absence of the Heaviside function as activation function, and thus Section <xref ref-type="sec" rid="s4_5_1">4.5.1</xref> is not the proof that the Rosenblatt perceptron cannot represent the XOR function. For such proof, see [<xref ref-type="bibr" rid="ref-121">121</xref>].</p></fn> then move on to a two-layer network, which can.</p>
<sec id="s4_5_1"><label>4.5.1</label>  
<title>One-layer network</title>
<p>Consider the following one-layer network,<xref ref-type="fn" rid="fn76"><sup>76</sup></xref><fn id="fn76"><label>76</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 167.</p></fn> in which the output <inline-formula id="ieqn-365"><mml:math id="mml-ieqn-365"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> is a linear combination of the coordinates <inline-formula id="ieqn-366"><mml:math id="mml-ieqn-366"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as inputs:</p>
<p><disp-formula id="eqn-43"><label>(43)</label><mml:math id="mml-eqn-43" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>+</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with the following matrices</p>
<p><disp-formula id="eqn-44"><label>(44)</label><mml:math id="mml-eqn-44" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-45"><label>(45)</label><mml:math id="mml-eqn-45" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>since it is written in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 14:</p>
<disp-quote><p>&#x201C;Model based on the <inline-formula id="ieqn-367"><mml:math id="mml-ieqn-367"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> used by the perceptron and ADALINE are called linear models. Linear models have many limitations... Most famously, they cannot learn the XOR function... Critics who observed these flaws in linear models caused a backlash against biologically inspired learning in general (Minsky and Papert, 1969). This was the first major dip in the popularity of neural networks.&#x201D;</p>
</disp-quote><p>First-time learners, who have not seen the definition of Rosenblatt&#x2019;s (1958) perceptron [<xref ref-type="bibr" rid="ref-119">119</xref>], could confuse Eq. (<xref ref-type="disp-formula" rid="eqn-43">43</xref>) as the perceptron&#x2014;which was not a linear model, but more importantly the Rosenblatt perceptron was a network with many neurons<xref ref-type="fn" rid="fn77"><sup>77</sup></xref><fn id="fn77"><label>77</label><p>See Section <xref ref-type="sec" rid="s13_2">13.2</xref> on the history of the linear combination (weighted sum) of inputs with biases.</p></fn>&#x2014;because Eq. (<xref ref-type="disp-formula" rid="eqn-43">43</xref>) is only a linear unit (a single neuron), and does not have an (nonlinear) activation function. A neuron in the Rosenblatt perceptron is Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>) in Section <xref ref-type="sec" rid="s13_2">13.2</xref>, with the Heaviside (nonlinear step) function as activation function; see Figure <xref ref-type="fig" rid="fig-132">132</xref>.</p>
<fig id="fig-37">
<label>Figure 37</label>
<caption><title><italic>Representing XOR function</italic> (Sections <xref ref-type="sec" rid="s4_5">4.5</xref>, <xref ref-type="sec" rid="s13_2">13.2</xref>). This one-layer network (which is not the Rosenblatt perceptron in Figure <xref ref-type="fig" rid="fig-132">132</xref>) cannot perform this task. For each input matrix <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in the design matrix <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> (see Table <xref ref-type="table" rid="table-2">2</xref>), the linear unit (neuron) <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-43">43</xref>) predict a value <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as output, which is collected in the output matrix <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>4</mml:mn></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. The MSE cost function <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-42">42</xref>) is used in a gradient descent to find the parameters <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. The result is a constant function, <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula>, for <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula>, which cannot represent the XOR function.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-37.tif"/>
</fig>
<p>The MSE cost function in Eq. (<xref ref-type="disp-formula" rid="eqn-42">42</xref>) becomes</p>
<p><disp-formula id="eqn-46"><label>(46)</label><mml:math id="mml-eqn-46" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>4</mml:mn></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>4</mml:mn></mml:munderover><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>b</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>b</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Setting the gradient of the cost function in Eq. (<xref ref-type="disp-formula" rid="eqn-46">46</xref>) to zero and solving the resulting equations, we obtain the weights and the bias:</p>
<p><disp-formula id="eqn-47"><label>(47)</label><mml:math id="mml-eqn-47" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mfrac><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mfrac><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-48"><label>(48)</label><mml:math id="mml-eqn-48" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x27F9;</mml:mo><mml:mn>2</mml:mn><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x27F9;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x27F9;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mstyle></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mo stretchy="false">&#x27F9;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>from which the predicted output <inline-formula id="ieqn-374"><mml:math id="mml-ieqn-374"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-43">43</xref>) is a constant for any points in the dataset (or design matrix) <inline-formula id="ieqn-375"><mml:math id="mml-ieqn-375"><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-49"><label>(49)</label><mml:math id="mml-eqn-49" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and thus this one-layer network cannot represent the XOR function. Eqs. (<xref ref-type="disp-formula" rid="eqn-48">48</xref>) are called the &#x201C;normal&#x201D; equations.<xref ref-type="fn" rid="fn78"><sup>78</sup></xref><fn id="fn78"><label>78</label><p>In least-square linear regression, the normal equations are often presented in matrix form, starting from the errors (or residuals) at the data points, gathered in the matrix <inline-formula id="ieqn-3064"><mml:math id="mml-ieqn-3064"><mml:mi mathvariant='bold-italic'>e</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula>. To minimize the squared of the errors represented by <inline-formula id="ieqn-3065"><mml:math id="mml-ieqn-3065"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">e</mml:mi><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, consider a perturbation <inline-formula id="ieqn-3066"><mml:math id="mml-ieqn-3066"><mml:msub><mml:mi mathvariant='bold-italic'>&#x03B8;</mml:mi><mml:mi>&#x03F5;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-3067"><mml:math id="mml-ieqn-3067"><mml:msub><mml:mi mathvariant='bold-italic'>e</mml:mi><mml:mi>&#x03F5;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>&#x03F5;</mml:mi></mml:msub></mml:math></inline-formula>, then set the directional derivative of <inline-formula id="ieqn-3068"><mml:math id="mml-ieqn-3068"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">e</mml:mi><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> to zero, i.e., <inline-formula id="ieqn-3069"><mml:math id="mml-ieqn-3069"><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mfrac><mml:mi>d</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>&#x03F5;</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x2225;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>e</mml:mi></mml:mstyle><mml:mi>&#x03F5;</mml:mi></mml:msub><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:math></inline-formula>, which is the &#x201C;normal equation&#x201D; in matrix form, since the error matrix <inline-formula id="ieqn-3070"><mml:math id="mml-ieqn-3070"><mml:mi mathvariant='bold-italic'>e</mml:mi></mml:math></inline-formula> is required to be &#x201C;normal&#x201D; (orthogonal) to the span of <inline-formula id="ieqn-3071"><mml:math id="mml-ieqn-3071"><mml:mi mathvariant='bold-italic'>X</mml:mi></mml:math></inline-formula>. For the above XOR function with four data points, the relevant matrices are (using the Matlab / Octave notation) <inline-formula id="ieqn-3072"><mml:math id="mml-ieqn-3072"><mml:mi mathvariant='bold-italic'>e</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-3073"><mml:math id="mml-ieqn-3073"><mml:mi mathvariant='bold-italic'>y</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, and <inline-formula id="ieqn-3074"><mml:math id="mml-ieqn-3074"><mml:mi mathvariant='bold-italic'>X</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, which also lead to Eq. (<xref ref-type="disp-formula" rid="eqn-48">48</xref>). See, e.g., [<xref ref-type="bibr" rid="ref-122">122</xref>] [<xref ref-type="bibr" rid="ref-123">123</xref>] and [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 106.</p></fn></p></sec>
<sec id="s4_5_2"><label>4.5.2</label>
<title>Two-layer network</title>
<p>The four points in Table <xref ref-type="table" rid="table-2">2</xref> are not linearly separable, i.e., there is no straight line that separates these four points such that the value of the XOR function is zero for two points on one side of the line, and one for the two points on the other side of the line. One layer could not represent the XOR function, as shown above. Rosenblatt (1958) [<xref ref-type="bibr" rid="ref-119">119</xref>] wrote:</p>
<disp-quote><p>&#x201C;It has, in fact, been widely conceded by psychologists that there is little point in trying to &#x2018;disprove&#x2019; any of the major learning theories in use today, since by extension, or a change in parameters, they have all proved capable of adapting to any specific empirical data. In considering this approach, one is reminded of a remark attributed to Kistiakowsky, that <italic>&#x2018;given seven parameters, I could fit an elephant</italic>.&#x2019; &#x201D;</p>
</disp-quote><p>So we now add a second layer, and thus more parameters in the hope to be able to represent the XOR function, as shown in Figure <xref ref-type="fig" rid="fig-38">38</xref>.<xref ref-type="fn" rid="fn79"><sup>79</sup></xref><fn id="fn79"><label>79</label><p>Our presentation is more detailed and more general than in [<xref ref-type="bibr" rid="ref-78">78</xref>], pp. 167-171, where there was no intuitive explanation of how the numbers were obtained, and where only the activation function ReLU was used.</p></fn></p>
<fig id="fig-38">
<label>Figure 38</label>
<caption><title><italic>Representing XOR function</italic> (Sections <xref ref-type="sec" rid="s4_5">4.5</xref>). This two-layer network can perform this task. The four points in the design matrix <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (see Table <xref ref-type="table" rid="table-2">2</xref>) are converted into three points that are linearly separable by the two nonlinear units (neurons or rows) of Layer (1), i.e., <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msup><mml:mi mathvariant="bold-italic">Y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>4</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo stretchy='false'>]</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant="double-struck">R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mtext>(1)</mml:mtext></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mtext>(1)</mml:mtext></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mtext>(1)</mml:mtext></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">B</mml:mi><mml:mrow><mml:mtext>(1)</mml:mtext></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant="double-struck">R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> , as in Eq. (<xref ref-type="disp-formula" rid="eqn-58">58</xref>), and <italic>a</italic>(&#x00B7;) a nonlinear activation function. Layer (2) consists of a single linear unit (neuron or row) with three parameters, i.e., <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover><mml:mn>4</mml:mn></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mtext>(2)</mml:mtext></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mtext>(2)</mml:mtext></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mtext>(2)</mml:mtext></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mtext>(2)</mml:mtext></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant="double-struck">R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></inline-formula>. The three non-aligned points in <bold><italic>X</italic></bold><sup>(2)</sup> offer three equations to solve for the three parameters <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mtext>(2)</mml:mtext></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mtext>(2)</mml:mtext></mml:mrow></mml:msubsup><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>; see Eq. (<xref ref-type="disp-formula" rid="eqn-61">61</xref>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-38.tif"/>
</fig>
<p><bold>Layer (1):</bold> six parameters (4 weights, 2 biases), plus a (nonlinear) activation function. The purpose is to change coordinates to move the four input points of the XOR function into three points, such that the two points with XOR value equal 1 are coalesced into a single point, and such that these three points are aligned on a straight line. Since these three points remain not linearly separable, the activation function then moves these three points out of alignment, and thus linearly separable.</p>
<p><disp-formula id="eqn-50"><label>(50)</label><mml:math id="mml-eqn-50" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-51"><label>(51)</label><mml:math id="mml-eqn-51" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msubsup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-52"><label>(52)</label><mml:math id="mml-eqn-52" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="center center center center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-53"><label>(53)</label><mml:math id="mml-eqn-53" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">B</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="right right right right" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-54"><label>(54)</label><mml:math id="mml-eqn-54" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">B</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>To map the two points <inline-formula id="ieqn-376"><mml:math id="mml-ieqn-376"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-377"><mml:math id="mml-ieqn-377"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, at which the XOR value is 1, into a single point, the two rows of <inline-formula id="ieqn-378"><mml:math id="mml-ieqn-378"><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are selected to be identically <inline-formula id="ieqn-379"><mml:math id="mml-ieqn-379"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> as shown in Eq. (<xref ref-type="disp-formula" rid="eqn-51">51</xref>). The first term in Eq. (<xref ref-type="disp-formula" rid="eqn-54">54</xref>) yields three points aligned along the bisector in the first quadrant (i.e., the line <inline-formula id="ieqn-380"><mml:math id="mml-ieqn-380"><mml:msub><mml:mi>z</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> in the z-plane), with all positive coordinates, Figure <xref ref-type="fig" rid="fig-39">39</xref>:</p>
<p><disp-formula id="eqn-55"><label>(55)</label><mml:math id="mml-eqn-55" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="center center center center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mtext>.</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-39">
<label>Figure 39</label>
<caption><title><italic>Two-layer network for XOR representation</italic> (Sections <xref ref-type="sec" rid="s4_5">4.5</xref>). <italic>Left</italic>: XOR function, with <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mi>A</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mi>B</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>; see Eq. (<xref ref-type="disp-formula" rid="eqn-52">52</xref>). The XOR value for the solid red dots is 1, and for the open blue dots 0. <italic>Right</italic>: Images of points <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi></mml:math></inline-formula> in the <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mi>z</mml:mi></mml:math></inline-formula>-plane due <italic>only</italic> to the first term of Eq. (<xref ref-type="disp-formula" rid="eqn-54">54</xref>), i.e., <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, which is shown in Eq. (<xref ref-type="disp-formula" rid="eqn-55">55</xref>). See also Figure <xref ref-type="fig" rid="fig-40">40</xref>. 
</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-39.tif"/>
</fig>
<p>For activation functions such as ReLu or Heaviside<xref ref-type="fn" rid="fn80"><sup>80</sup></xref><fn id="fn80"><label>80</label><p>In general, the Heaviside function is not used as activation function since its gradient is zero, and thus would not work for gradient descent. But for this XOR problem <italic>without</italic> using gradient descent, the Heaviside function offers a workable solution as the rectified linear function.</p></fn> to have any effect, the above three points are next translated in the negative <inline-formula id="ieqn-381"><mml:math id="mml-ieqn-381"><mml:msub><mml:mi>z</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> direction using the biases in Eq. (<xref ref-type="disp-formula" rid="eqn-53">53</xref>), so that Eq. (<xref ref-type="disp-formula" rid="eqn-54">54</xref>) yields:</p>
<p><disp-formula id="eqn-56"><label>(56)</label><mml:math id="mml-eqn-56" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="center center center center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and thus</p>
<p><disp-formula id="eqn-57"><label>(57)</label><mml:math id="mml-eqn-57" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">Y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="center center center center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For general activation function <inline-formula id="ieqn-382"><mml:math id="mml-ieqn-382"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the outputs of Layer (1) are:</p>
<p><disp-formula id="eqn-58"><label>(58)</label><mml:math id="mml-eqn-58" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">Y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="center center center center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><bold>Layer (2):</bold> three parameters (2 weights, 1 bias), no activation function. Eq. (<xref ref-type="disp-formula" rid="eqn-59">59</xref>) for this layer is identical to Eq. (<xref ref-type="disp-formula" rid="eqn-43">43</xref>) for the one-layer network above, with the output <inline-formula id="ieqn-383"><mml:math id="mml-ieqn-383"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of Layer (1) as input <inline-formula id="ieqn-384"><mml:math id="mml-ieqn-384"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, as shown in Eq. (<xref ref-type="disp-formula" rid="eqn-57">57</xref>):</p>
<p><disp-formula id="eqn-59"><label>(59)</label><mml:math id="mml-eqn-59" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with three distinct points in Eq. (<xref ref-type="disp-formula" rid="eqn-57">57</xref>), because <inline-formula id="ieqn-385"><mml:math id="mml-ieqn-385"><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, to solve for these three parameters:</p>
<p><disp-formula id="eqn-60"><label>(60)</label><mml:math id="mml-eqn-60" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>We have three equations:</p>
<p><disp-formula id="eqn-61"><label>(61)</label><mml:math id="mml-eqn-61" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mn>4</mml:mn></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>w</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>b</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>for which the exact analytical solution for the parameters <inline-formula id="ieqn-386"><mml:math id="mml-ieqn-386"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is easy to obtain, but the expressions are rather lengthy. Hence, here we only give the numerical solution for <inline-formula id="ieqn-387"><mml:math id="mml-ieqn-387"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in the case of the logistic sigmoid function in Table <xref ref-type="table" rid="table-3">3</xref>.</p>
<table-wrap id="table-3"><label>Table 3</label>
<caption>
<p><italic>Two-layer network for XOR representation</italic> (Section <xref ref-type="sec" rid="s4_5">4.5</xref>). Values of parameters <inline-formula id="ieqn-388"><mml:math id="mml-ieqn-388"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-60">60</xref>). The results are exact for ReLU and Heaviside, but rounded for sigmoid due to the irrational Euler&#x2019;s number <inline-formula id="ieqn-389"><mml:math id="mml-ieqn-389"><mml:mi>e</mml:mi></mml:math></inline-formula>. See Figure <xref ref-type="fig" rid="fig-40">40</xref>.</p></caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="center">Activation function</th>
<th align="center">Parameters <inline-formula id="ieqn-390"><mml:math id="mml-ieqn-390"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">ReLU</td>
<td align="center"><inline-formula id="ieqn-391"><mml:math id="mml-ieqn-391"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mphantom><mml:mo>&#x2212;</mml:mo></mml:mphantom></mml:mrow><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="center">Heaviside</td>
<td align="center"><inline-formula id="ieqn-392"><mml:math id="mml-ieqn-392"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mphantom><mml:mo>&#x2212;</mml:mo></mml:mphantom></mml:mrow><mml:mn>0</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="center">Sigmoid</td>
<td align="center"><inline-formula id="ieqn-393"><mml:math id="mml-ieqn-393"><mml:mo stretchy="false">[</mml:mo><mml:mn>24.5942</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>20.2663</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>6.8466</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math>
</inline-formula></td> </tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-40">
<label>Figure 40</label>
<caption><title><italic>Two-layer network for XOR representation</italic> (Sections <xref ref-type="sec" rid="s4_5">4.5</xref>). <italic>Left</italic>: Images of points <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi></mml:math></inline-formula> of <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-56">56</xref>), obtained after a translation by adding the bias <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-51">51</xref>) to the same points <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi></mml:math></inline-formula> in the right subfigure of Figure <xref ref-type="fig" rid="fig-39">39</xref>. The XOR value for the solid red dots is 1, and for the open blue dots 0. <italic>Right</italic>: Images of points <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi></mml:math></inline-formula> after applying the ReLU activation function, which moves point <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi>A</mml:mi></mml:math></inline-formula> to the origin; see Eq. (<xref ref-type="disp-formula" rid="eqn-57">57</xref>). the points <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>A</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>B</mml:mi></mml:math></inline-formula> <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mo stretchy="false">(</mml:mo><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>D</mml:mi></mml:math></inline-formula> are no longer aligned, and thus linearly separable by the green dotted line, whose normal vector has the components <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, which are the weights shown in Table <xref ref-type="table" rid="table-3">3</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-40.tif"/>
</fig>
<p>We conjecture that any (nonlinear) function <inline-formula id="ieqn-429"><mml:math id="mml-ieqn-429"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in the zoo of activation functions listed, e.g., in &#x201C;Activation function&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Activation_function&amp;oldid=897708534">version 21:00, 18 May 2019</ext-link> or in [<xref ref-type="bibr" rid="ref-36">36</xref>] (see Figure <xref ref-type="fig" rid="fig-139">139</xref>), would move the three points in <inline-formula id="ieqn-430"><mml:math id="mml-ieqn-430"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-56">56</xref>) out of alignment, and thus provide the corresponding unique solution <inline-formula id="ieqn-431"><mml:math id="mml-ieqn-431"><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> for Eq. (<xref ref-type="disp-formula" rid="eqn-61">61</xref>).</p> 
<statement id="st4_4"><title><xref ref-type="statement" rid="st4_4">Remark 4.4</xref>.</title>
<p><italic>Number of parameters</italic>. In 1953, Physicist Freeman Dyson (Princeton Institute of Advanced Study) once consulted with Nobel Laureate Enrico Fermi about a new mathematical model for a difficult physics problem that Dyson and his students had just developed. Fermi asked Dyson how many parameters they had. &#x201C;Four&#x201D;, Dyson replied. Fermi then gave his now famous comment &#x201C;I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk&#x201D; [<xref ref-type="bibr" rid="ref-124">124</xref>].</p>
<p>But it was only more than sixty years later that physicists were able to plot an elephant in 2-D using a model with four complex numbers as parameters [<xref ref-type="bibr" rid="ref-125">125</xref>].</p>
<p>With nine parameters, the elephant can be made to walk (representing the XOR function), and with a billion parameters, it may even perform some acrobatic maneuver in 3-D; see Section <xref ref-type="sec" rid="s4_6">4.6</xref> on depth of multilayer networks.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec></sec>
<sec id="s4_6"><label>4.6</label>
<title>What is &#x201C;deep&#x201D; in &#x201C;deep networks&#x201D; ? Size, architecture</title>
<sec id="s4_6_1"><label>4.6.1</label>
<title>Depth, size</title>
<p>The concept of network depth turns out to be more complex than initially thought. While for a <italic>fully-connected</italic> feedforward neural network (in which all outputs of a layer are connected to a neuron in the following layer), depth could be considered as the number of layers, there is in general no consensus on the accepted definition of depth. It was stated in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 8, that:<xref ref-type="fn" rid="fn81"><sup>81</sup></xref><fn id="fn81"><label>81</label><p>There are two viewpoints on the definition of depth, one based on the computational graph, and one based on the conceptual graph. From the computational-graph viewpoint, depth is the number of sequential instructions that must be executed in an architecture. From the conceptual-graph viewpoint, depth is the number of concept levels, going from simple concepts to more complex concepts. See also [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 163, for the depth of fully-connected feedforward networks as the &#x201C;length of the chain&#x201D; in Eq. (<xref ref-type="disp-formula" rid="eqn-18">18</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-23">23</xref>), which is the number of layers.</p></fn></p>
<disp-quote><p>&#x201C;There is no single correct value for the depth of an architecture,<xref ref-type="fn" rid="fn82"><sup>82</sup></xref><fn id="fn82"><label>82</label><p>There are several different network architectures. <italic>Convolutional neural networks</italic> (CNN) use sparse connections, have achieved great success in image recognition, and contributed to the burst of interest in deep learning since winning the ImageNet competion in 2012 by almost halving the image classification error rate; see [<xref ref-type="bibr" rid="ref-13">13</xref>, <xref ref-type="bibr" rid="ref-12">12</xref>, <xref ref-type="bibr" rid="ref-75">75</xref>]. <italic>Recurrent neural networks</italic> (RNN) are used to process a sequence of inputs to a system with changing states as in a dynamical system, to be discussed in Section <xref ref-type="sec" rid="s7">7</xref>. there are other networks with skip connections, in which information flows from layer <inline-formula id="ieqn-3075"><mml:math id="mml-ieqn-3075"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to layer <inline-formula id="ieqn-3076"><mml:math id="mml-ieqn-3076"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, skipping layer <inline-formula id="ieqn-3077"><mml:math id="mml-ieqn-3077"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>; see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 196.</p></fn> just as there is no single correct value for the length of a computer Program. Nor is there a consensus about how much depth a model requires to Qualify as &#x201C;deep.&#x201D; &#x201D;</p>
</disp-quote><p>For example, keeping the number of layers the same, then the &#x201C;depth&#x201D; of a <italic>sparsely-connected</italic> feedforward network (in which not all outputs of a layer are connected to a neuron in the following layer) should be smaller than the &#x201C;depth&#x201D; of a <italic>fully-connected</italic> feedforward network.</p>
<p>The lack of consensus on the boundary between &#x201C;shallow&#x201D; and &#x201C;deep&#x201D; networks is echoed in [<xref ref-type="bibr" rid="ref-12">12</xref>]:</p>
<disp-quote><p>&#x201C;At which problem depth does <italic>Shallow Learning</italic> end, and <italic>Deep Learning</italic> begin? Discussions with DL experts have not yet yielded a conclusive response to this question. Instead of committing myself to a precise answer, let me just define for the purposes of this overview: problems of depth <inline-formula id="ieqn-432"><mml:math id="mml-ieqn-432"><mml:mo>&gt;</mml:mo></mml:math></inline-formula> 10 require <italic>Very Deep Learning</italic>.&#x201D;</p>
</disp-quote> 
<statement id="st4_5"><title><xref ref-type="statement" rid="st4_5">Remark 4.5</xref>.</title>
<p><italic>Action depth, state depth</italic>. In view of Remark <xref ref-type="statement" rid="st4_3">4.3</xref>, which type of layer (action or state) were they talking about in the above quotation? We define here <italic>action depth</italic> as the number of action layers, and <italic>state depth</italic> as the number of state layers. The abstract network in Figure <xref ref-type="fig" rid="fig-23">23</xref> has action depth <inline-formula id="ieqn-433"><mml:math id="mml-ieqn-433"><mml:mi>L</mml:mi></mml:math></inline-formula> and state depth <inline-formula id="ieqn-434"><mml:math id="mml-ieqn-434"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-435"><mml:math id="mml-ieqn-435"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as the number of hidden (state) layers.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>The review paper [<xref ref-type="bibr" rid="ref-13">13</xref>] was attributed in [<xref ref-type="bibr" rid="ref-38">38</xref>] for stating that &#x201C;training neural networks with more than three hidden layers is called deep learning&#x201D;, implying that a network is considered &#x201C;deep&#x201D; if its number of hidden (state) layers <inline-formula id="ieqn-436"><mml:math id="mml-ieqn-436"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&gt;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula>. In the work reported in [<xref ref-type="bibr" rid="ref-38">38</xref>], the authors used networks with number of hidden (state) layers <inline-formula id="ieqn-437"><mml:math id="mml-ieqn-437"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> varying from one to five, and with a constant hidden (state) layer width of <inline-formula id="ieqn-438"><mml:math id="mml-ieqn-438"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>50</mml:mn></mml:math></inline-formula>, for all hidden (state) layers <inline-formula id="ieqn-439"><mml:math id="mml-ieqn-439"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>; see Table <xref ref-type="table" rid="table-1">1</xref> in [<xref ref-type="bibr" rid="ref-38">38</xref>], reproduced in Figure <xref ref-type="fig" rid="fig-99">99</xref> in Section <xref ref-type="sec" rid="s10_2_2">10.2.2</xref>.</p>
<p>An example of recognizing multidigit numbers in photographs of addresses, in which the test accuracy increased (or test error decreased) with increasing depth, is provided in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 196; see Figure <xref ref-type="fig" rid="fig-41">41</xref>.</p>
<fig id="fig-41">
<label>Figure 41</label>
<caption><title><italic>Test accuracy versus network depth</italic> (Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref>), showing that test accuracy for this example increases monotonically with the network depth (number of layers). [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 196. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-41.tif"/>
</fig>
<p>But it is not clear where in [<xref ref-type="bibr" rid="ref-13">13</xref>] that it was actually said that a network is &#x201C;deep&#x201D; if the number of hidden (state) layers is greater than three. An example in image recognition having more than three layers was, however, given in [<xref ref-type="bibr" rid="ref-13">13</xref>] (emphases are ours):</p>
<disp-quote><p>&#x201C;An image, for example, comes in the form of an array of pixel values, and the learned features in the <italic>first layer</italic> of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The <italic>second layer</italic> typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The <italic>third layer</italic> may assemble motifs into larger combinations that correspond to parts of familiar objects, and <italic>subsequent layers</italic> would detect objects as combinations of these parts.&#x201D;</p>
</disp-quote><p>But the above was not a criterion for a network to be considered as &#x201C;deep&#x201D;. It was further noted on the number of the model parameters (weights and biases) and the size of the training dataset for a &#x201C;typical deep-learning system&#x201D; as follows [<xref ref-type="bibr" rid="ref-13">13</xref>] (emphases are ours):</p>
<disp-quote><p>&#x201C; In a typical <italic>deep-learning</italic> system, there may be <italic>hundreds of millions</italic> of these adjustable <italic>weights</italic>, and <italic>hundreds of millions</italic> of labelled examples with which to train the machine.&#x201D;</p>
</disp-quote><p>See Remark <xref ref-type="statement" rid="st7_2">7.2</xref> on recurrent neural networks (RNNs) as equivalent to &#x201C;very deep feedforward networks&#x201D;. Another example was also provided in [<xref ref-type="bibr" rid="ref-13">13</xref>]:</p>
<disp-quote><p>&#x201C;Recent ConvNet [convolutional neural network, or CNN]<xref ref-type="fn" rid="fn83"><sup>83</sup></xref><fn id="fn83"><label>83</label><p>A special type of deep network that went out of favor, then now back in favor, among the computer-vision and machine-learning communities after the spectacular success that ConvNet garnered at the 2012 ImageNet competition; see [<xref ref-type="bibr" rid="ref-13">13</xref>] [<xref ref-type="bibr" rid="ref-75">75</xref>] [<xref ref-type="bibr" rid="ref-74">74</xref>]. Since we are reviewing in detail some specific applications of deep networks to computational mechanics, we will not review ConvNet here, but focus on MultiLayer Neural (MLN)&#x2014;also known as MultiLayer Perceptron (MLP)&#x2014;networks.</p></fn> architectures have 10 to 20 layers of ReLUs [rectified linear units], hundreds of millions of weights, and billions of connections between Units.&#x201D;<xref ref-type="fn" rid="fn84"><sup>84</sup></xref><fn id="fn84"><label>84</label><p>A network processing &#x201C;unit&#x201D; is also called a &#x201C;neuron&#x201D;.</p></fn></p>
</disp-quote><p>A neural network with 160 billion parameters was perhaps the largest in 2015 [<xref ref-type="bibr" rid="ref-126">126</xref>]:</p>
<disp-quote><p>&#x201C;Digital Reasoning, a cognitive computing company based in Franklin, Tenn., recently announced that it has trained a neural network consisting of 160 billion parameters&#x2014;more than 10 times larger than previous neural networks.</p>
<p>The Digital Reasoning neural network easily surpassed previous records held by Google&#x2019;s 11.2-billion parameter system and Lawrence Livermore National Laboratory&#x2019;s 15-billion parameter system.&#x201D;</p>
</disp-quote><p>As mentioned above, for general network architectures (other than feedforward networks), not only that there is no consensus on the definition of depth, there is also no consensus on how much depth a network must have to qualify as being &#x201C;deep&#x201D;; see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 8, who offered the following intentionally vague definition:</p>
<disp-quote><p>&#x201C;Deep learning can be safely regarded as the study of models that involve a greater amount of composition of either learned functions or learned concepts than traditional machine learning does.&#x201D;</p>
</disp-quote><p>Figure <xref ref-type="fig" rid="fig-42">42</xref> depicts the increase in the number of neurons in neural networks over time, from 1958 (Network 1 by Rosenblatt (1958) [<xref ref-type="bibr" rid="ref-119">119</xref>] in Figure <xref ref-type="fig" rid="fig-42">42</xref> with one neuron, which was an error in [<xref ref-type="bibr" rid="ref-78">78</xref>], as discussed in Section <xref ref-type="sec" rid="s13_2">13.2</xref>) to 2014 (Network 20 GoogleNet with more than one million neurons), which was still far below the more than ten million biological neurons in a frog.</p></sec>
<sec id="s4_6_2"><label>4.6.2</label>
<title>Architecture</title>
<p>The architecture of a network is the number of layers (depth), the layer width (number of neurons per layer), and the connection among the neurons.<xref ref-type="fn" rid="fn85"><sup>85</sup></xref><fn id="fn85"><label>85</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 166.</p></fn> We have seen the architecture of fully-connected feedforward neural networks above; see Figure <xref ref-type="fig" rid="fig-23">23</xref> and Figure <xref ref-type="fig" rid="fig-35">35</xref>.</p>
<p>One example of an architecture different from that fully-connected feedforward networks is convolutional neural networks, which are based on the convolutional integral (see Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>) in Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> on &#x201C;Dynamic, time dependence, Volterra series&#x201D;), and which had proven to be successful long before deep-learning networks:</p>
<disp-quote><p>&#x201C;Convolutional networks were also some of the first neural networks to solve important commercial applications and remain at the forefront of commercial applications of deep learning today. By the end of the 1990s, this system deployed by NEC was reading over 10 percent of all the checks in the United States. Later, several OCR and handwriting recognition systems based on convolutional nets were deployed by Microsoft.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 360.</p>
</disp-quote>
<disp-quote><p>&#x201C;Fully-connected networks were believed not to work well. It may be that the primary barriers to the success of neural networks were psychological (practitioners did not expect neural networks to work, so they did not make a serious effort to use neural networks). Whatever the case, it is fortunate that convolutional networks performed well decades ago. In many ways, they carried the torch for the rest of deep learning and paved the way to the acceptance of neural networks in general.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 361.</p>
</disp-quote>
<fig id="fig-42">
<label>Figure 42</label>
<caption><title><italic>Increasing network size over time</italic> (Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref>, <xref ref-type="sec" rid="s13_2">13.2</xref>). All networks before 2015 had their number of neurons smaller than that of a frog at <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mn>1.6</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mn>7</mml:mn></mml:msup></mml:math></inline-formula>, and still far below that in a human brain at <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mn>8.6</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>; see &#x201C;List of animals by number of neurons&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=List_of_animals_by_number_of_neurons&amp;oldid=896223835.tif">version 02:46, 9 May 2019</ext-link>. In [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 23, it was estimated that neural network size would double every 2.4 years (a clear parallel to Moore&#x2019;s law, which stated that the number of transistors on integrated circuits doubled every 2 years). It was mentioned in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 23, that Network 1 by Rosenblatt (1958 [<xref ref-type="bibr" rid="ref-119">119</xref>], 1962 [<xref ref-type="bibr" rid="ref-2">2</xref>]) as having one neuron (see figure above), which was incorrect, since Rosenblatt (1957) [<xref ref-type="bibr" rid="ref-1">1</xref>] conceived a network with 1000 neurons, and even built the Mark I computer to run this network; see Section <xref ref-type="sec" rid="s13_2">13.2</xref> and Figure <xref ref-type="fig" rid="fig-133">133</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-42.tif"/>
</fig>
<p>Here, we present a more recent and successful network architecture different from the fully-connected feedforward network. Residual network was introduced in [<xref ref-type="bibr" rid="ref-127">127</xref>] to address the problem of vanishing gradient that plagued &#x201C;very deep&#x201D; networks with as few as 16 layers during training (see Section <xref ref-type="sec" rid="s5">5</xref> on Backpropagation) and the problem of increased training error and test error with increased network depth as shown in Figure <xref ref-type="fig" rid="fig-43">43</xref>.</p>
<statement id="st4_6"><title><xref ref-type="statement" rid="st4_6">Remark 4.6</xref>.</title>
<p><italic>Training error, test (generalization) error.</italic> Using a set of data, called training data, to find the parameters that minimize the loss function (i.e., doing the training) provides the training error, which is the least square error between the predicted outputs and the training data. Then running the optimally trained model on a different set of data, which was not been used for the training, called test data, provides the test error, also known as generalization error. More details can be found in Section <xref ref-type="sec" rid="s6">6</xref>, and in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 107.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement><p>The basic building block of residual network is shown in Figure <xref ref-type="fig" rid="fig-44">44</xref>, and a full residual network in Figure <xref ref-type="fig" rid="fig-45">45</xref>. The rationale for residual networks was that, if the identity map were optimal, it would be easier for the optimization (training) process to drive the residual <inline-formula id="ieqn-443"><mml:math id="mml-ieqn-443"><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> down to zero than to fit the identity map with a bunch of nonlinear layers; see [<xref ref-type="bibr" rid="ref-127">127</xref>], where it was mentioned that deep residual networks won 1st places in several image recognition competitions.</p>
<fig id="fig-43">
<label>Figure 43</label>
<caption><title><italic>Training/test error vs. iterations, depth</italic> (Sections <xref ref-type="sec" rid="s4_6_2">4.6.2</xref>, <xref ref-type="sec" rid="s6">6</xref>). The training error and test error of deep fully-connected networks increased when the number of layers (depth) increased [<xref ref-type="bibr" rid="ref-127">127</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-43.tif"/>
</fig>
<fig id="fig-44">
<label>Figure 44</label>
<caption><title><italic>Residual network</italic> (Sections <xref ref-type="sec" rid="s4_6_2">4.6.2</xref>, <xref ref-type="sec" rid="s6">6</xref>), basic building block having two layers with the rectified linear activation function (ReLU), for which the input is <inline-formula id="ieqn-440"><mml:math id="mml-ieqn-440"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula>, the output is <inline-formula id="ieqn-441"><mml:math id="mml-ieqn-441"><mml:mrow><mml:mi>&#x0210B;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula>, where the internal mapping function <inline-formula id="ieqn-442"><mml:math id="mml-ieqn-442"><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x0210B;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> is called the residual. Chaining this building block one after another forms a deep residual network; see Figure <xref ref-type="fig" rid="fig-45">45</xref> [<xref ref-type="bibr" rid="ref-127">127</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-44.tif"/>
</fig>
<statement id="st4_7"><title><xref ref-type="statement" rid="st4_7">Remark 4.7</xref>.</title>
<p>The identity map that jumps over a number of layers in the residual network building block in Figure <xref ref-type="fig" rid="fig-44">44</xref> and in the full residual network in Figure <xref ref-type="fig" rid="fig-45">45</xref> is based on a concept close to that for the path of the cell state <inline-formula id="ieqn-444"><mml:math id="mml-ieqn-444"><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in the Long Short-Term Memory (LSTM) unit for recurrent neural networks (RNN), as described in Figure <xref ref-type="fig" rid="fig-81">81</xref> in Section <xref ref-type="sec" rid="s7_2">7.2</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement><p>A deep residual network with more than 1,200 layers was proposed in [<xref ref-type="bibr" rid="ref-128">128</xref>]. A wide residual-network architecture that outperformed deep and thin networks was proposed in [<xref ref-type="bibr" rid="ref-129">129</xref>]: &#x201C;For instance, [their] wide 16-layer deep network has the same accuracy as a 1000-layer thin deep network and a comparable number of parameters, although being several times faster to train.&#x201D;</p>
<p>It is still not clear why some architecture worked well, while others did not:</p>
<disp-quote><p>&#x201C;The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 186.</p></disp-quote>
<fig id="fig-45">
<label>Figure 45</label>
<caption><title><italic>Full residual network</italic> (Sections <xref ref-type="sec" rid="s4_6_2">4.6.2</xref>, <xref ref-type="sec" rid="s6">6</xref>) with 34 layers, made up from 16 building blocks with two layers each (Figure <xref ref-type="fig" rid="fig-44">44</xref>), together with an input and an output layer. This residual network has a total of 3.6 billion floating-point operations (FLOPs with fused multiply-add operations), which could be considered as the network &#x201C;computational depth&#x201D; [<xref ref-type="bibr" rid="ref-127">127</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-45.tif"/>
</fig>
</sec></sec></sec>
<sec id="s5"><label>5</label>
<title>Backpropagation</title>
<p>Backpropagation, sometimes abbreviated as &#x201C;backprop&#x201D;, was a child of whom many could claim to be the father, and is used to compute the gradient of the cost function with respect to the parameters (weights and biases); see Section <xref ref-type="sec" rid="s13_4_1">13.4.1</xref> for a history of backpropagation. This gradient is then subsequently used in an optimization process, usually the Stochastic Gradient Descent method, to find the parameters that minimize the cost or loss function.</p>
<sec id="s5_1"><label>5.1</label>
<title>Cost (loss, error) function</title>
<p>Two types of cost function are discussed here: (1) the mean squared error (MSE), and (2) the maximum likelihood (probability cost).<xref ref-type="fn" rid="fn86"><sup>86</sup></xref><fn id="fn86"><label>86</label><p>For other types of loss function, see, e.g., (1) Section &#x201C;Loss functions&#x201D; in &#x201C;torch.nn&#x2014;PyTorch Master Documentation&#x201D; (<ext-link ext-link-type="uri" xlink:href="https://pytorch.org/docs/stable/nn.html">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20191031180518/https://pytorch.org/docs/stable/nn.html">Internet archive</ext-link>), and (2) Jah 2019, A Brief Overview of Loss Functions in Pytorch (<ext-link ext-link-type="uri" xlink:href="https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20191118171435/https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7">Internet archive</ext-link>).</p></fn></p>
<sec id="s5_1_1"><label>5.1.1</label>
<title>Mean squared error</title>
<p>For a given input <inline-formula id="ieqn-445"><mml:math id="mml-ieqn-445"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> (a single example) and target output <inline-formula id="ieqn-446"><mml:math id="mml-ieqn-446"><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>m</mml:mi></mml:msup></mml:math></inline-formula>, the squared error (SE) of the predicted output <inline-formula id="ieqn-447"><mml:math id="mml-ieqn-447"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>m</mml:mi></mml:msup></mml:math></inline-formula> for use in least squared error problem is defined as half the squared error:</p>
<p><disp-formula id="eqn-62"><label>(62)</label><mml:math id="mml-eqn-62" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>SE</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The factor <inline-formula id="ieqn-448"><mml:math id="mml-ieqn-448"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula> is for the convenience of avoiding to carry the factor 2 when taking the gradient of the cost (or loss) function <inline-formula id="ieqn-450"><mml:math id="mml-ieqn-450"><mml:mi>J</mml:mi></mml:math></inline-formula>.<xref ref-type="fn" rid="fn87"><sup>87</sup></xref><fn id="fn87"><label>87</label><p>There is an inconsistent use of notation in [<xref ref-type="bibr" rid="ref-78">78</xref>] that could cause confusion, e.g., in [<xref ref-type="bibr" rid="ref-78">78</xref>], Chap. 5, p. 104, Eq. (5.4), the notation <inline-formula id="ieqn-3078"><mml:math id="mml-ieqn-3078"><mml:mrow><mml:mi mathvariant="bold-italic">&#x0177;</mml:mi></mml:mrow></mml:math></inline-formula> (with the hat) was defined as the network outputs, i.e., predicted values, with <inline-formula id="ieqn-3079"><mml:math id="mml-ieqn-3079"><mml:mrow><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:mrow></mml:math></inline-formula> (without the hat) as target values, whereas later in Chap. 6, p. 163, the notation <inline-formula id="ieqn-3080"><mml:math id="mml-ieqn-3080"><mml:mrow><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:mrow></mml:math></inline-formula> (without the hat) was used for the network outputs. Also, in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 105, the cost function was defined as the mean squared error, without the factor <inline-formula id="ieqn-3081"><mml:math id="mml-ieqn-3081"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula>. See also Footnote <xref ref-type="fn" rid="fn52">52</xref>.</p></fn></p>
<p>While the components <inline-formula id="ieqn-451"><mml:math id="mml-ieqn-451"><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> on the output matrix <inline-formula id="ieqn-452"><mml:math id="mml-ieqn-452"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> cannot be <italic>independent and identically distributed</italic> (<italic>i.i.d</italic>.), since <inline-formula id="ieqn-453"><mml:math id="mml-ieqn-453"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> must represent a recognizable pattern (e.g., an image), in the case of training with <inline-formula id="ieqn-454"><mml:math id="mml-ieqn-454"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> examples as inputs:<xref ref-type="fn" rid="fn88"><sup>88</sup></xref><fn id="fn88"><label>88</label><p>In our notation, <inline-formula id="ieqn-3082"><mml:math id="mml-ieqn-3082"><mml:mi>m</mml:mi></mml:math></inline-formula> is the dimension of the output array <inline-formula id="ieqn-3083"><mml:math id="mml-ieqn-3083"><mml:mrow><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:mrow></mml:math></inline-formula>, whereas &#x1D5C6; (in a different font) is here the number of examples in Eq. (<xref ref-type="disp-formula" rid="eqn-63">63</xref>), and later represents the minibatch size in Eqs. (<xref ref-type="disp-formula" rid="eqn-136">136</xref>)-(<xref ref-type="disp-formula" rid="eqn-138">138</xref>). The size of the whole training set, called the &#x201C;full batch&#x201D; (Footnote <xref ref-type="fn" rid="fn117">117</xref>), is denoted by &#x1D5AC; (Footnote <xref ref-type="fn" rid="fn144">144</xref>).</p></fn></p>
<p><disp-formula id="eqn-63"><label>(63)</label><mml:math id="mml-eqn-63" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-455"><mml:math id="mml-ieqn-455"><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow></mml:math></inline-formula> is the set of <inline-formula id="ieqn-456"><mml:math id="mml-ieqn-456"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> examples, and <inline-formula id="ieqn-457"><mml:math id="mml-ieqn-457"><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow></mml:math></inline-formula> the set of the corresponding outputs, the examples <inline-formula id="ieqn-458"><mml:math id="mml-ieqn-458"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> can be <italic>i.i.d</italic>., and the half MSE cost function for these outputs is half the expectation of the SE:</p>
<p><disp-formula id="eqn-64"><label>(64)</label><mml:math id="mml-eqn-64" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>MSE</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>S</mml:mi><mml:msup><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mo>&#x2225;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
</sec>
<sec id="s5_1_2"><label>5.1.2</label>
<title>Maximum likelihood (probability cost)</title>
<p>Many (if not most) modern networks employed a probability cost function based in the principle of maximum likelihood, which has the form of negative log-likelihood, describing the cross-entropy between the training data with probability distribution <inline-formula id="ieqn-459"><mml:math id="mml-ieqn-459"><mml:msub><mml:mrow><mml:mover><mml:mi>p</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the model with probability distribution <inline-formula id="ieqn-460"><mml:math id="mml-ieqn-460"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> ([<xref ref-type="bibr" rid="ref-78">78</xref>], p. 173):</p>
<p><disp-formula id="eqn-65"><label>(65)</label><mml:math id="mml-eqn-65" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold">y</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>p</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-461"><mml:math id="mml-ieqn-461"><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:math></inline-formula> is the expectation; <inline-formula id="ieqn-462"><mml:math id="mml-ieqn-462"><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-463"><mml:math id="mml-ieqn-463"><mml:mrow><mml:mi mathvariant="bold">y</mml:mi></mml:mrow></mml:math></inline-formula> are random variables for training data with distribution <inline-formula id="ieqn-464"><mml:math id="mml-ieqn-464"><mml:msub><mml:mrow><mml:mover><mml:mi>p</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>; the inputs <inline-formula id="ieqn-465"><mml:math id="mml-ieqn-465"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> and the target outputs <inline-formula id="ieqn-466"><mml:math id="mml-ieqn-466"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> are values of <inline-formula id="ieqn-467"><mml:math id="mml-ieqn-467"><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-468"><mml:math id="mml-ieqn-468"><mml:mrow><mml:mi mathvariant="bold">y</mml:mi></mml:mrow></mml:math></inline-formula>, respectively, and <inline-formula id="ieqn-469"><mml:math id="mml-ieqn-469"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the conditional probability of the distribution of the target outputs <inline-formula id="ieqn-470"><mml:math id="mml-ieqn-470"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> given the inputs <inline-formula id="ieqn-471"><mml:math id="mml-ieqn-471"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> and the parameters <inline-formula id="ieqn-472"><mml:math id="mml-ieqn-472"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula>, with the predicted outputs <inline-formula id="ieqn-473"><mml:math id="mml-ieqn-473"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> given by the model <inline-formula id="ieqn-474"><mml:math id="mml-ieqn-474"><mml:mi>f</mml:mi></mml:math></inline-formula> (neural network), having as arguments the inputs <inline-formula id="ieqn-475"><mml:math id="mml-ieqn-475"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> and the parameters <inline-formula id="ieqn-476"><mml:math id="mml-ieqn-476"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-66"><label>(66)</label><mml:math id="mml-eqn-66" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x21D4;</mml:mo><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The expectations of a function <inline-formula id="ieqn-477"><mml:math id="mml-ieqn-477"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of a random variable <inline-formula id="ieqn-478"><mml:math id="mml-ieqn-478"><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:math></inline-formula>, having a probability distribution <inline-formula id="ieqn-479"><mml:math id="mml-ieqn-479"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for the discrete case, and the probability distribution density <inline-formula id="ieqn-480"><mml:math id="mml-ieqn-480"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for the continuous case, are respectively<xref ref-type="fn" rid="fn89"><sup>89</sup></xref><fn id="fn89"><label>89</label><p>The simplified notation <inline-formula id="ieqn-3086"><mml:math id="mml-ieqn-3086"><mml:mo fence="false" stretchy="false">&#x2329;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> for expectation <inline-formula id="ieqn-3087"><mml:math id="mml-ieqn-3087"><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with implied probability distribution, is used in Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref> on step-length decay and simulated annealing (Remark <xref ref-type="statement" rid="st6_9">6.9</xref>, Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>) as an add-on improvement to the stochastic gradient descent algorithm.</p></fn></p>
<p><disp-formula id="eqn-67"><label>(67)</label><mml:math id="mml-eqn-67" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>x</mml:mi></mml:munder><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
 
<statement id="st5_1"><title><xref ref-type="statement" rid="st5_1">Remark 5.1</xref>.</title>
<p><italic>Information content, Shannon entropy, maximum likelihood</italic>. The expression in Eq. (<xref ref-type="disp-formula" rid="eqn-65">65</xref>)&#x2014;with the minus sign and the log function&#x2014;can be abstract to readers not familiar with the probability concept of maximum likelihood, which is related to the concepts of information content and Shannon entropy. First, an event <inline-formula id="ieqn-481"><mml:math id="mml-ieqn-481"><mml:mi>x</mml:mi></mml:math></inline-formula> with low probability (e.g., an asteroid will hit the Earth tomorrow) would have higher information content than an event with high probability (e.g., the sun will rise tomorrow morning). since the probability of <inline-formula id="ieqn-482"><mml:math id="mml-ieqn-482"><mml:mi>x</mml:mi></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-483"><mml:math id="mml-ieqn-483"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, is between 0 and 1, the negative of the logarithm of <inline-formula id="ieqn-484"><mml:math id="mml-ieqn-484"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e.,</p>
<p><disp-formula id="eqn-68"><label>(68)</label><mml:math id="mml-eqn-68" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>called the information content of <inline-formula id="ieqn-485"><mml:math id="mml-ieqn-485"><mml:mi>x</mml:mi></mml:math></inline-formula>, would have large values near zero, and small values near 1. In addition, the probability of two independent events to occur is the product of the probabilities of these events, e.g., the probability of having two heads in two coin tosses is</p>
<p><disp-formula id="eqn-69"><label>(69)</label><mml:math id="mml-eqn-69" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>head</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>head</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>head</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>head</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>4</mml:mn></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The product (chain) rule of conditional probabilities consists of expressing a joint probability of several random variables <inline-formula id="ieqn-486"><mml:math id="mml-ieqn-486"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> as the product<xref ref-type="fn" rid="fn90"><sup>90</sup></xref><fn id="fn90"><label>90</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 57. The notation <inline-formula id="ieqn-3088"><mml:math id="mml-ieqn-3088"><mml:msup><mml:mtext>x</mml:mtext><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> (with vertical bars enclosing the superscript <inline-formula id="ieqn-3089"><mml:math id="mml-ieqn-3089"><mml:mi>k</mml:mi></mml:math></inline-formula>) is used to designate example <inline-formula id="ieqn-3090"><mml:math id="mml-ieqn-3090"><mml:mi>k</mml:mi></mml:math></inline-formula> in the set <inline-formula id="ieqn-3091"><mml:math id="mml-ieqn-3091"><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow></mml:math></inline-formula> of examples in Eq. (<xref ref-type="disp-formula" rid="eqn-73">73</xref>), instead of the notation <inline-formula id="ieqn-3092"><mml:math id="mml-ieqn-3092"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> (with parentheses), since the parentheses were already used to surround the layer number <inline-formula id="ieqn-3093"><mml:math id="mml-ieqn-3093"><mml:mi>k</mml:mi></mml:math></inline-formula>, as in Figure <xref ref-type="fig" rid="fig-35">35</xref>.</p></fn></p>
<p><disp-formula id="eqn-70"><label>(70)</label><mml:math id="mml-eqn-70" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The logarithm of the products in Eq. (<xref ref-type="disp-formula" rid="eqn-69">69</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-70">70</xref>) is the sum of the factor probabilities, and provides another reason to use the logarithm in the expression for information content in Eq. (<xref ref-type="disp-formula" rid="eqn-68">68</xref>): Independent events have additive information. Concretely, the information content of two asteroids independently hitting the Earth should double that of one asteroid hitting the Earth.</p>
<p>The parameters <inline-formula id="ieqn-487"><mml:math id="mml-ieqn-487"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> that minimize the probability cost <inline-formula id="ieqn-488"><mml:math id="mml-ieqn-488"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-65">65</xref>) can be expressed as<xref ref-type="fn" rid="fn91"><sup>91</sup></xref><fn id="fn91"><label>91</label><p>A tilde is put on top of <inline-formula id="ieqn-3094"><mml:math id="mml-ieqn-3094"><mml:mrow><mml:mi mathvariant='bold-italic'>&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> to indicate that the matrix <inline-formula id="ieqn-3095"><mml:math id="mml-ieqn-3095"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> contains the estimated values of the parameters (weights and biases), called the estimates, not the true parameters. Recall from Footnote <xref ref-type="fn" rid="fn87">87</xref> that [<xref ref-type="bibr" rid="ref-78">78</xref>] used an overhead &#x201C;hat&#x201D; (<inline-formula id="ieqn-3096"><mml:math id="mml-ieqn-3096"><mml:mrow><mml:mover><mml:mo>&#x22C5;</mml:mo><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>) to indicate predicted value; see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 120, where <inline-formula id="ieqn-3097"><mml:math id="mml-ieqn-3097"><mml:mrow><mml:mi mathvariant='bold-italic'>&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> is defined as the true parameters, and <inline-formula id="ieqn-3098"><mml:math id="mml-ieqn-3098"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> the predicted (or estimated) parameters.</p></fn></p>
<p><disp-formula id="eqn-71"><label>(71)</label><mml:math id="mml-eqn-71" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold">y</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>p</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-72"><label>(72)</label><mml:math id="mml-eqn-72" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-73"><label>(73)</label><mml:math id="mml-eqn-73" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>with&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-489"><mml:math id="mml-ieqn-489"><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow></mml:math></inline-formula> is the set of <inline-formula id="ieqn-490"><mml:math id="mml-ieqn-490"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> examples that are <italic>independent and identically distributed</italic> (<italic>i.i.d</italic>.), and <inline-formula id="ieqn-491"><mml:math id="mml-ieqn-491"><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow></mml:math></inline-formula> the set of the corresponding outputs. The final form in Eq. (<xref ref-type="disp-formula" rid="eqn-72">72</xref>), i.e.,</p>
<p><disp-formula id="eqn-74"><label>(74)</label><mml:math id="mml-eqn-74" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>is called the <italic>Principle of Maximum Likelihood</italic>, in which the model parameters are optimized to maximize the likelihood to reproduce the empirical data.<xref ref-type="fn" rid="fn92"><sup>92</sup></xref><fn id="fn92"><label>92</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 128.</p></fn>&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st5_2"><title><xref ref-type="statement" rid="st5_2">Remark 5.2</xref>.</title>
<p><italic>Relation between Mean Squared Error and Maximum Likelihood</italic>. The MSE is a particular case of the Maximum Likelihood. Consider having <inline-formula id="ieqn-492"><mml:math id="mml-ieqn-492"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> examples <inline-formula id="ieqn-493"><mml:math id="mml-ieqn-493"><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> that are independent and identically distributed (<italic>i.i.d</italic>.), as in Eq. (<xref ref-type="disp-formula" rid="eqn-63">63</xref>). If the model probability <inline-formula id="ieqn-494"><mml:math id="mml-ieqn-494"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> has a normal distribution, with the predicted output</p>
<p><disp-formula id="eqn-75"><label>(75)</label><mml:math id="mml-eqn-75" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>as in Eq. (<xref ref-type="disp-formula" rid="eqn-66">66</xref>), predicting the mean of this normal distribution,<xref ref-type="fn" rid="fn93"><sup>93</sup></xref><fn id="fn93"><label>93</label><p>The normal (Gaussian) distribution of scalar random variable <inline-formula id="ieqn-3099"><mml:math id="mml-ieqn-3099"><mml:mi>x</mml:mi></mml:math></inline-formula>, mean <inline-formula id="ieqn-3100"><mml:math id="mml-ieqn-3100"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula>, and variance <inline-formula id="ieqn-3101"><mml:math id="mml-ieqn-3101"><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> is written as <inline-formula id="ieqn-3102"><mml:math id="mml-ieqn-3102"><mml:mrow><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>; see, e.g., [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 24.</p></fn> then</p>
<p><disp-formula id="eqn-76"><label>(76)</label><mml:math id="mml-eqn-76" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-495"><mml:math id="mml-ieqn-495"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> designating the standard deviation, i.e., the error between the target output <inline-formula id="ieqn-496"><mml:math id="mml-ieqn-496"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> and the predicted output <inline-formula id="ieqn-497"><mml:math id="mml-ieqn-497"><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is normally distributed. By taking the negative of the logarithm of <inline-formula id="ieqn-498"><mml:math id="mml-ieqn-498"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, we have</p>
<p><disp-formula id="eqn-77"><label>(77)</label><mml:math id="mml-eqn-77" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Then summing Eq. (<xref ref-type="disp-formula" rid="eqn-77">77</xref>) over all examples <inline-formula id="ieqn-499"><mml:math id="mml-ieqn-499"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> as in the last expression in Eq. (<xref ref-type="disp-formula" rid="eqn-71">71</xref>) yields</p>
<p><disp-formula id="eqn-78"><label>(78)</label><mml:math id="mml-eqn-78" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and thus the minimizer <inline-formula id="ieqn-449"><mml:math id="mml-ieqn-449"><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-71">71</xref>) can be written as</p>
<p><disp-formula id="eqn-79"><label>(79)</label><mml:math id="mml-eqn-79" display="block"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>arg</mml:mi><mml:munder><mml:mrow><mml:mi>max</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:munder><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:munder><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:munder><mml:mspace width="thinmathspace" /><mml:mi>J</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where the MSE cost function <inline-formula id="ieqn-501"><mml:math id="mml-ieqn-501"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> was defined in Eq. (<xref ref-type="disp-formula" rid="eqn-64">64</xref>), noting that constants such as <inline-formula id="ieqn-502"><mml:math id="mml-ieqn-502"><mml:mi>m</mml:mi></mml:math></inline-formula> or <inline-formula id="ieqn-503"><mml:math id="mml-ieqn-503"><mml:mn>2</mml:mn><mml:mi>m</mml:mi></mml:math></inline-formula> do not affect the value of the minimizer <inline-formula id="ieqn-504"><mml:math id="mml-ieqn-504"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>.</p>
<p>Thus finding the minimizer of the maximum likelihood cost function in Eq. (<xref ref-type="disp-formula" rid="eqn-65">65</xref>) is the same as finding the minimizer of the MSE in Eq. (<xref ref-type="disp-formula" rid="eqn-62">62</xref>); see also [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 130.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>Remark <xref ref-type="statement" rid="st5_2">5.2</xref> justifies the use of Mean Squared Error as <italic>a</italic> Maximum Likelihood estimator.<xref ref-type="fn" rid="fn94"><sup>94</sup></xref><fn id="fn94"><label>94</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 130.</p></fn> For the purpose of this review paper, it is sufficient to use the MSE cost function in Eq. (<xref ref-type="disp-formula" rid="eqn-42">42</xref>) to develop the backpropagation procedure.</p></sec>
<sec id="s5_1_3"><label>5.1.3</label>
<title>Classification loss function</title>
<p>In classification tasks&#x2014;such as used in [<xref ref-type="bibr" rid="ref-38">38</xref>], Section <xref ref-type="sec" rid="s10_2">10.2</xref>, and Footnote <xref ref-type="fn" rid="fn265">265</xref>&#x2014;a neural network is trained to predict which of <inline-formula id="ieqn-505"><mml:math id="mml-ieqn-505"><mml:mi>k</mml:mi></mml:math></inline-formula> different classes (categories) an input <inline-formula id="ieqn-506"><mml:math id="mml-ieqn-506"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> belongs to. The most simple classification problem only has two classes (<inline-formula id="ieqn-507"><mml:math id="mml-ieqn-507"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>), which can be represented by the values <inline-formula id="ieqn-508"><mml:math id="mml-ieqn-508"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> of a single binary variable <inline-formula id="ieqn-509"><mml:math id="mml-ieqn-509"><mml:mi>y</mml:mi></mml:math></inline-formula>. The probability distribution of such single boolean-valued variable is called <italic>Bernoulli distribution</italic>.<xref ref-type="fn" rid="fn95"><sup>95</sup></xref><fn id="fn95"><label>95</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 68.</p></fn> The Bernoulli distribution is characterized by a single parameter <inline-formula id="ieqn-510"><mml:math id="mml-ieqn-510"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e., the conditional probability of <inline-formula id="ieqn-511"><mml:math id="mml-ieqn-511"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> belonging to the class <inline-formula id="ieqn-512"><mml:math id="mml-ieqn-512"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. To perform binary classification, a neural network is therefore trained to estimate the conditional probability distribution <inline-formula id="ieqn-513"><mml:math id="mml-ieqn-513"><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> using the principle of maximum likelihood (see Section <xref ref-type="sec" rid="s5_1_2">5.1.2</xref>, Eq. (<xref ref-type="disp-formula" rid="eqn-65">65</xref>)):</p>
<p><disp-formula id="eqn-80"><label>(80)</label><mml:math id="mml-eqn-80" display="block"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>y</mml:mi></mml:mstyle><mml:mo>&#x007E;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>p</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x007C;</mml:mo><mml:mstyle  mathvariant="bold-italic" mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mo>&#x1D5C6;</mml:mo></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x1D5C6;</mml:mo></mml:mrow></mml:munderover><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mi>log</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x007C;</mml:mo><mml:msup><mml:mstyle  mathvariant="bold-italic" mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x007C;</mml:mo><mml:msup><mml:mstyle  mathvariant="bold-italic" mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The output of the neural network is supposed to represent the probability <inline-formula id="ieqn-514"><mml:math id="mml-ieqn-514"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e., a real-valued number in the interval <inline-formula id="ieqn-515"><mml:math id="mml-ieqn-515"><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. A linear output layer <inline-formula id="ieqn-516"><mml:math id="mml-ieqn-516"><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> does not meet this constraint in general. To squash the output of the linear layer into the range of <inline-formula id="ieqn-517"><mml:math id="mml-ieqn-517"><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, the logistic sigmoid function <inline-formula id="ieqn-518"><mml:math id="mml-ieqn-518"><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow></mml:math></inline-formula> (see Figure <xref ref-type="fig" rid="fig-30">30</xref>) can be added to the linear output unit to render <inline-formula id="ieqn-519"><mml:math id="mml-ieqn-519"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> a probability</p>
<p><disp-formula id="eqn-81"><label>(81)</label><mml:math id="mml-eqn-81" display="block"><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>a</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>In case more than two categories occur in a classification problem, a neural network is trained to estimate the probability distribution over the discrete number (<inline-formula id="ieqn-520"><mml:math id="mml-ieqn-520"><mml:mi>k</mml:mi><mml:mo>&gt;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>) of classes. Such distribution is referred to as <italic>multinoulli</italic> or <italic>categorial</italic> distribution, which is parameterized by the conditional probabilities <inline-formula id="ieqn-521"><mml:math id="mml-ieqn-521"><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> of an input <inline-formula id="ieqn-522"><mml:math id="mml-ieqn-522"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> belonging to the <inline-formula id="ieqn-523"><mml:math id="mml-ieqn-523"><mml:mi>i</mml:mi></mml:math></inline-formula>-th category. The output of the neural network accordingly is a <inline-formula id="ieqn-524"><mml:math id="mml-ieqn-524"><mml:mi>k</mml:mi></mml:math></inline-formula>-dimensional vector <inline-formula id="ieqn-525"><mml:math id="mml-ieqn-525"><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, where
<inline-formula id="ieqn-526"><mml:math id="mml-ieqn-526"><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. In addition to the requirement of each component <inline-formula id="ieqn-527"><mml:math id="mml-ieqn-527"><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> being in the range <inline-formula id="ieqn-528"><mml:math id="mml-ieqn-528"><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, we must also guarantee that all components sum up to 1 to satisfy the definition of a probability distribution.</p>
<p>For this purpose, the idea of <italic>exponentiation and normalization</italic>, which can be expressed as a change of variable in the logistic sigmoid function <inline-formula id="ieqn-529"><mml:math id="mml-ieqn-529"><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow></mml:math></inline-formula> (Figure <xref ref-type="fig" rid="fig-30">30</xref>, Section <xref ref-type="sec" rid="s5_3_1">5.3.1</xref>), as in the following example [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 177:</p>
<p><disp-formula id="eqn-82"><label>(82)</label><mml:math id="mml-eqn-82" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>constant</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-83"><label>(83)</label><mml:math id="mml-eqn-83" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and is generalized then to vector-valued outputs; see also Figure <xref ref-type="fig" rid="fig-46">46</xref>.</p>
<p>The <italic>softmax function</italic> converts the vector formed by a linear unit <inline-formula id="ieqn-530"><mml:math id="mml-ieqn-530"><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msup></mml:math></inline-formula> into the vector of probabilities <inline-formula id="ieqn-531"><mml:math id="mml-ieqn-531"><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> by means of</p>
<p><disp-formula id="eqn-84"><label>(84)</label><mml:math id="mml-eqn-84" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and is a smoothed version of the max function [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 198.<xref ref-type="fn" rid="fn96"><sup>96</sup></xref><fn id="fn96"><label>96</label><p>See also [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 179 and p. 78, where the softmax function is used to stabilize against the underflow and overflow problem in numerical computation.</p></fn></p>
<fig id="fig-46">
<label>Figure 46</label>
<caption><title><italic>Sofmax function for two classes, logistic sigmoid</italic> (Section <xref ref-type="sec" rid="s5_1_3">5.1.3</xref>, <xref ref-type="sec" rid="s5_3_1">5.3.1</xref>): <inline-formula id="ieqn-3118"><mml:math id="mml-ieqn-3118"><mml:mrow><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-3119"><mml:math id="mml-ieqn-3119"><mml:mrow><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math></inline-formula>, such that <inline-formula id="ieqn-3120"><mml:math id="mml-ieqn-3120"><mml:mi mathvariant="fraktur">s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="fraktur">s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>17</mml:mn></mml:math></inline-formula>. See also Figure <xref ref-type="fig" rid="fig-30">30</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-46.tif"/>
</fig>
<statement id="st5_3"><title><xref ref-type="statement" rid="st5_3">Remark 5.3</xref>.</title>
<p><italic>Softmax function from Bayes&#x2019; theorem</italic>. For a classification with multiple classes <inline-formula id="ieqn-532"><mml:math id="mml-ieqn-532"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, particularized to the case of two classes with <inline-formula id="ieqn-533"><mml:math id="mml-ieqn-533"><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>, the probability for class <inline-formula id="ieqn-534"><mml:math id="mml-ieqn-534"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>, given the input column matrix <inline-formula id="ieqn-535"><mml:math id="mml-ieqn-535"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:math></inline-formula>, is obtained from Bayes&#x2019; theorem<xref ref-type="fn" rid="fn97"><sup>97</sup></xref><fn id="fn97"><label>97</label><p>Since the probability of <inline-formula id="ieqn-3103"><mml:math id="mml-ieqn-3103"><mml:mi>x</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-3104"><mml:math id="mml-ieqn-3104"><mml:mi>y</mml:mi></mml:math></inline-formula> is <inline-formula id="ieqn-3105"><mml:math id="mml-ieqn-3105"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and since <inline-formula id="ieqn-3106"><mml:math id="mml-ieqn-3106"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (which is the product rule), where <inline-formula id="ieqn-3107"><mml:math id="mml-ieqn-3107"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the probability of <inline-formula id="ieqn-3108"><mml:math id="mml-ieqn-3108"><mml:mi>x</mml:mi></mml:math></inline-formula> given <inline-formula id="ieqn-3109"><mml:math id="mml-ieqn-3109"><mml:mi>y</mml:mi></mml:math></inline-formula>, we have <inline-formula id="ieqn-3110"><mml:math id="mml-ieqn-3110"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and thus <inline-formula id="ieqn-3111"><mml:math id="mml-ieqn-3111"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The sum rule is <inline-formula id="ieqn-3112"><mml:math id="mml-ieqn-3112"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>y</mml:mi></mml:munder><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. See, e.g., [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 15. The right-hand side of the second equation in Eq. (<xref ref-type="disp-formula" rid="eqn-85">85</xref>)<inline-formula id="ieqn-3113"><mml:math id="mml-ieqn-3113"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> makes common sense in terms of the predator-prey problem, in which <inline-formula id="ieqn-3114"><mml:math id="mml-ieqn-3114"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> would be the percentage of predator in the total predator-prey population, and <inline-formula id="ieqn-3115"><mml:math id="mml-ieqn-3115"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the percentage of prey, as the self-proclaimed &#x201C;best mathematician of France&#x201D; Laplace said &#x201C;probability theory is nothing but common sense reduced to calculation&#x201D; [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 24.</p></fn> as follows ([<xref ref-type="bibr" rid="ref-130">130</xref>], p. 197):</p>
<p><disp-formula id="eqn-85"><label>(85)</label><mml:math id="mml-eqn-85" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-86"><label>(86)</label><mml:math id="mml-eqn-86" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>z</mml:mi><mml:mo>:=</mml:mo><mml:mi>ln</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the product rule was applied to the numerator of Eq. (<xref ref-type="disp-formula" rid="eqn-85">85</xref>)<inline-formula id="ieqn-536"><mml:math id="mml-ieqn-536"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, the sum rule to the denominator, and <inline-formula id="ieqn-537"><mml:math id="mml-ieqn-537"><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow></mml:math></inline-formula> the logistic sigmoid. Likewise,</p>
<p><disp-formula id="eqn-87"><label>(87)</label><mml:math id="mml-eqn-87" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-88"><label>(88)</label><mml:math id="mml-eqn-88" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>as in Eq. (<xref ref-type="disp-formula" rid="eqn-83">83</xref>), and <inline-formula id="ieqn-538"><mml:math id="mml-ieqn-538"><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow></mml:math></inline-formula> is also called a normalized exponential or softmax function for <inline-formula id="ieqn-539"><mml:math id="mml-ieqn-539"><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>. Using the same procedure, for <inline-formula id="ieqn-540"><mml:math id="mml-ieqn-540"><mml:mi>K</mml:mi><mml:mo>&gt;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>, the softmax function (version 1) can be written as<xref ref-type="fn" rid="fn98"><sup>98</sup></xref><fn id="fn98"><label>98</label><p>See also [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 115, version 1 of softmax function, i.e., <inline-formula id="ieqn-3116"><mml:math id="mml-ieqn-3116"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2211;</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-3117"><mml:math id="mml-ieqn-3117"><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x2211;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2264;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula>, had &#x201C;1&#x201D; as a summand in the denominator, similar to Eq. (<xref ref-type="disp-formula" rid="eqn-89">89</xref>) while version 2 did not, similar to Eq. (<xref ref-type="disp-formula" rid="eqn-90">90</xref>) [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 198, and was the same as Eq. (<xref ref-type="disp-formula" rid="eqn-84">84</xref>).</p></fn>
<disp-formula id="eqn-89"><label>(89)</label><mml:math id="mml-eqn-89" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>:=</mml:mo><mml:mi>ln</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mtext>.</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Using a different definition, the softmax function (version 2) can be written as</p>
<p><disp-formula id="eqn-90"><label>(90)</label><mml:math id="mml-eqn-90" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>:=</mml:mo><mml:mi>ln</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is the same as Eq. (<xref ref-type="disp-formula" rid="eqn-84">84</xref>).&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec></sec>
<sec id="s5_2"><label>5.2</label>
<title>Gradient of cost function by backpropagation</title>
<p>The gradient of a cost function <inline-formula id="ieqn-541"><mml:math id="mml-ieqn-541"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with respect to the parameters <inline-formula id="ieqn-542"><mml:math id="mml-ieqn-542"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> is obtained using the chain rule of differentiation, and backpropagation is an efficient way to compute the chain rule. In the forward propagation, the computation (or function composition) moves from the first layer <inline-formula id="ieqn-543"><mml:math id="mml-ieqn-543"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to the last layer <inline-formula id="ieqn-544"><mml:math id="mml-ieqn-544"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>; in the backpropagation, the computation moves in reverse order, from the last layer <inline-formula id="ieqn-545"><mml:math id="mml-ieqn-545"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to the first layer <inline-formula id="ieqn-546"><mml:math id="mml-ieqn-546"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p> 
<statement id="st5_4"><title><xref ref-type="statement" rid="st5_4">Remark 5.4</xref>.</title>
<p>We focus our attention on developing backpropagation for fully-connected networks, for which an explicit derivation was not provided in [<xref ref-type="bibr" rid="ref-78">78</xref>], but would help clarify the pseudocode.<xref ref-type="fn" rid="fn99"><sup>99</sup></xref><fn id="fn99"><label>99</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], Section <xref ref-type="sec" rid="s6_5_4">6.5.4</xref>, p. 206, Algorithm 6.4.</p></fn> The approach in [<xref ref-type="bibr" rid="ref-78">78</xref>] was based on computational graph, which would not be familiar to first-time learners from computational mechanics, albeit more general in that it was applicable to networks with more general architecture, such as those with skipped connections, which require keeping track of parent and child processing units for constructing the path of backpropagation. See also Appendix <xref ref-type="app" rid="app1">1</xref> where the backprop Algorithm <xref ref-type="fig" rid="fig-159">1</xref> below is rewritten in a different form to explain the equivalent Algorithm 6.4 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 206.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>It is convenient to recall here some equations developed earlier (keeping the same equation numbers) for the computation of the gradient <inline-formula id="ieqn-547"><mml:math id="mml-ieqn-547"><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi><mml:mo>/</mml:mo><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> of the cost function <inline-formula id="ieqn-548"><mml:math id="mml-ieqn-548"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with respect to the parameters <inline-formula id="ieqn-549"><mml:math id="mml-ieqn-549"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in layer <inline-formula id="ieqn-550"><mml:math id="mml-ieqn-550"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, going backward from the last layer <inline-formula id="ieqn-551"><mml:math id="mml-ieqn-551"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>.</p>
<p><list list-type="bullet">
<list-item><p>Cost function <inline-formula id="ieqn-553"><mml:math id="mml-ieqn-553"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p></list-item>
</list></p>
<p><disp-formula id="eqn-62a"><label>(62)</label><mml:math id="mml-eqn-62a" display="block"><mml:mrow><mml:mi>J</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x03B8;</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mtext>MSE</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x1D5C6;</mml:mo></mml:mrow></mml:mfrac><mml:mo>&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x1D5C6;</mml:mo></mml:mrow></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x1D5C6;</mml:mo></mml:mrow></mml:munderover><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mstyle><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><list list-type="bullet">
<list-item><p>Inputs <inline-formula id="ieqn-555"><mml:math id="mml-ieqn-555"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> with <inline-formula id="ieqn-556"><mml:math id="mml-ieqn-556"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> and predicted outputs <inline-formula id="ieqn-557"><mml:math id="mml-ieqn-557"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> with <inline-formula id="ieqn-558"><mml:math id="mml-ieqn-558"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>:</p></list-item>
</list></p>
<p><disp-formula id="eqn-19a"><label>(19)</label><mml:math id="mml-eqn-19a" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mtext>&#xA0;(inputs)&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mtext>&#xA0;(predicted outputs)&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-47">
<label>Figure 47</label>
<caption><title><italic>Backpropagation building block, typical layer <inline-formula id="ieqn-612"><mml:math id="mml-ieqn-612"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></italic> (Section <xref ref-type="sec" rid="s5_2">5.2</xref>, Algorithm <xref ref-type="fig" rid="fig-159">1</xref>, Appendix <xref ref-type="app" rid="app1">1</xref>). The forward propagation path is shown in blue, with the backpropagation path in red. The update of the parameters <inline-formula id="ieqn-613"><mml:math id="mml-ieqn-613"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in layer <inline-formula id="ieqn-614"><mml:math id="mml-ieqn-614"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is done as soon as the gradient <inline-formula id="ieqn-615"><mml:math id="mml-ieqn-615"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is available using a gradient descent algorithm. The row matrix <inline-formula id="ieqn-616"><mml:math id="mml-ieqn-616"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">r</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-104">104</xref>) can be computed once for use to evaluate both the gradient <inline-formula id="ieqn-617"><mml:math id="mml-ieqn-617"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-105">105</xref>) and the gradient <inline-formula id="ieqn-618"><mml:math id="mml-ieqn-618"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-106">106</xref>), then discarded to free up memory. See pseudocode in Algorithm <xref ref-type="fig" rid="fig-159">1</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-47.tif"/>
</fig>
<p><list list-type="bullet">
<list-item><p>Weighted sum of inputs and biases <inline-formula id="ieqn-560"><mml:math id="mml-ieqn-560"><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>:</p></list-item>
</list></p>
<p><disp-formula id="eqn-26a"><label>(26)</label><mml:math id="mml-eqn-26a" display="block"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#x2009;such&#x2009;that&#x2009;</mml:mtext><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mtext>&#x2009;,&#x2009;&#x2009;for&#x2009;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><list list-type="bullet">
<list-item><p>Network parameters <inline-formula id="ieqn-562"><mml:math id="mml-ieqn-562"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> and layer parameters <inline-formula id="ieqn-563"><mml:math id="mml-ieqn-563"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>:</p></list-item>
</list></p>
<p><disp-formula id="eqn-30a"><label>(30)</label><mml:math id="mml-eqn-30a" display="block"><mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x0398;</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x007C;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-31a"><label>(31)</label><mml:math id="mml-eqn-31a" display="block"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x0398;</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x0398;</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x0398;</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;such&#x00A0;that&#x00A0;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x0398;</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula></p>
<p><list list-type="bullet">
<list-item><p>Expanded layer outputs <inline-formula id="ieqn-565"><mml:math id="mml-ieqn-565"><mml:mrow><mml:msup><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>:</p></list-item>
</list></p>
<p><disp-formula id="eqn-32a"><label>(32)</label><mml:math id="mml-eqn-32a" display="block"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2261;</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x007C;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>:</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><list list-type="bullet">
<list-item><p>Activation function <inline-formula id="ieqn-567"><mml:math id="mml-ieqn-567"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p></list-item>
</list></p>
<p><disp-formula id="eqn-35a"><label>(35)</label><mml:math id="mml-eqn-35a" display="block"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x2009;such&#x2009;that&#x2009;</mml:mtext><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The gradient of the cost function <inline-formula id="ieqn-568"><mml:math id="mml-ieqn-568"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with respect to the parameters <inline-formula id="ieqn-569"><mml:math id="mml-ieqn-569"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in layer <inline-formula id="ieqn-570"><mml:math id="mml-ieqn-570"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, for <inline-formula id="ieqn-571"><mml:math id="mml-ieqn-571"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, is simply:</p>
<p><disp-formula id="eqn-91"><label>(91)</label><mml:math id="mml-eqn-91" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo stretchy="false">&#x21D4;</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-92"><label>(92)</label><mml:math id="mml-eqn-92" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;(row)</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-48">
<label>Figure 48</label>
<caption><title><italic>Backpropagation in fully-connected network</italic> (Section <xref ref-type="sec" rid="s5_2">5.2</xref>, <xref ref-type="sec" rid="s5_3">5.3</xref>, Algorithm <xref ref-type="fig" rid="fig-159">1</xref>, Appendix <xref ref-type="app" rid="app1">1</xref>). Starting from the predicted output <inline-formula id="ieqn-620"><mml:math id="mml-ieqn-620"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> In the last layer <inline-formula id="ieqn-621"><mml:math id="mml-ieqn-621"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at the end of any forward propagation (blue arrows), and going backward (red arrows) to the first layer with <inline-formula id="ieqn-622"><mml:math id="mml-ieqn-622"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and along the way at layer <inline-formula id="ieqn-623"><mml:math id="mml-ieqn-623"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, compute the gradient of the cost function <inline-formula id="ieqn-624"><mml:math id="mml-ieqn-624"><mml:mi>J</mml:mi></mml:math></inline-formula> relative the the parameters <inline-formula id="ieqn-625"><mml:math id="mml-ieqn-625"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> to update those parameters in a gradient descent, then compute the gradient of <inline-formula id="ieqn-626"><mml:math id="mml-ieqn-626"><mml:mi>J</mml:mi></mml:math></inline-formula> relative to the outputs <inline-formula id="ieqn-627"><mml:math id="mml-ieqn-627"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of the lower-level layer <inline-formula id="ieqn-628"><mml:math id="mml-ieqn-628"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to continue the backpropagation. See pseudocode in Algorithm <xref ref-type="fig" rid="fig-159">1</xref>. For a particular example of the above general case, see Figure <xref ref-type="fig" rid="fig-51">51</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-48.tif"/>
</fig>
<p>The above equations are valid for the last layer <inline-formula id="ieqn-580"><mml:math id="mml-ieqn-580"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula>, since since the predicted output <inline-formula id="ieqn-581"><mml:math id="mml-ieqn-581"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> is the same as the output of the last layer <inline-formula id="ieqn-582"><mml:math id="mml-ieqn-582"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-583"><mml:math id="mml-ieqn-583"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> by Eq. (<xref ref-type="disp-formula" rid="eqn-19">19</xref>). Similarly, these equations are also valid for the first layer <inline-formula id="ieqn-584"><mml:math id="mml-ieqn-584"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> since the input for layer <inline-formula id="ieqn-585"><mml:math id="mml-ieqn-585"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is <inline-formula id="ieqn-586"><mml:math id="mml-ieqn-586"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. using Eq. (<xref ref-type="disp-formula" rid="eqn-35">35</xref>), we obtain (no sum on <inline-formula id="ieqn-587"><mml:math id="mml-ieqn-587"><mml:mi>k</mml:mi></mml:math></inline-formula>)</p>
<p><disp-formula id="eqn-93"><label>(93)</label><mml:math id="mml-eqn-93" display="block"><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:munder><mml:mstyle mathsize='140%' displaystyle='true'><mml:mo>&#x2211;</mml:mo></mml:mstyle><mml:mi>p</mml:mi></mml:munder><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:munder><mml:mstyle mathsize='140%' displaystyle='true'><mml:mo>&#x2211;</mml:mo></mml:mstyle><mml:mi>p</mml:mi></mml:munder><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></disp-formula></p>
<p>Using Eq. (<xref ref-type="disp-formula" rid="eqn-93">93</xref>) in Eq. (<xref ref-type="disp-formula" rid="eqn-91">91</xref>) leads to the expressions for the gradient, both in component form (left) and in matrix form (right):</p>
<p><disp-formula id="eqn-94"><label>(94)</label><mml:math id="mml-eqn-94" display="block"><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mtext>&#x2009;</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mtext>no&#x2009;sum&#x2009;on&#x00A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x21D4;</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2299;</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-588"><mml:math id="mml-ieqn-588"><mml:mtext>&#x2299;</mml:mtext></mml:math></inline-formula> is the elementwise multiplication, known as the Hadamard operator, defined as follows:</p>
<p><disp-formula id="eqn-95"><label>(95)</label><mml:math id="mml-eqn-95" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>q</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2299;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>q</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>q</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>q</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:msub><mml:mi>q</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mrow><mml:mtext>&#xA0;(no sum on&#xA0;</mml:mtext><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mtext>)</mml:mtext></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-159">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-159.tif"/>
</fig>
<p>and</p>
<p><disp-formula id="eqn-96"><label>(96)</label><mml:math id="mml-eqn-96" display="block"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>row</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>column</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x21D2;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>column</mml:mtext><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-97"><label>(97)</label><mml:math id="mml-eqn-97" display="block"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2299;</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>column</mml:mtext><mml:mo>,</mml:mo><mml:mtext>no&#x2009;sum&#x2009;on&#x2009;</mml:mtext><mml:mi>i</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;and</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>column</mml:mtext><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-98"><label>(98)</label><mml:math id="mml-eqn-98" display="block"><mml:mrow><mml:mo>&#x21D2;</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2299;</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mtext>&#x00A0;</mml:mtext></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>which then agrees with the matrix dimension in the first expression for <inline-formula id="ieqn-589"><mml:math id="mml-ieqn-589"><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-92">92</xref>). For the last layer <inline-formula id="ieqn-590"><mml:math id="mml-ieqn-590"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula>, all terms on the right-hand side of Eq. (<xref ref-type="disp-formula" rid="eqn-98">98</xref>) are available for the computation of the gradient <inline-formula id="ieqn-591"><mml:math id="mml-ieqn-591"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> since</p>
<p><disp-formula id="eqn-99"><label>(99)</label><mml:math id="mml-eqn-99" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#xA0;(row)</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-592"><mml:math id="mml-ieqn-592"><mml:mi>m</mml:mi></mml:math></inline-formula> being the number of examples and <inline-formula id="ieqn-593"><mml:math id="mml-ieqn-593"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> the width of layer <inline-formula id="ieqn-594"><mml:math id="mml-ieqn-594"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, is the mean error from the expression of the cost function in Eq. (<xref ref-type="disp-formula" rid="eqn-64">64</xref>), with <inline-formula id="ieqn-595"><mml:math id="mml-ieqn-595"><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-596"><mml:math id="mml-ieqn-596"><mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> already computed in the forward propagation. To compute the gradient of the cost function with respect to the parameters <inline-formula id="ieqn-597"><mml:math id="mml-ieqn-597"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in layer <inline-formula id="ieqn-598"><mml:math id="mml-ieqn-598"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, we need the derivative <inline-formula id="ieqn-599"><mml:math id="mml-ieqn-599"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, per Eq. (<xref ref-type="disp-formula" rid="eqn-98">98</xref>). Thus, in general, the derivative of cost function <inline-formula id="ieqn-600"><mml:math id="mml-ieqn-600"><mml:mi>J</mml:mi></mml:math></inline-formula> with respect to the output matrix <inline-formula id="ieqn-601"><mml:math id="mml-ieqn-601"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of layer <inline-formula id="ieqn-602"><mml:math id="mml-ieqn-602"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-603"><mml:math id="mml-ieqn-603"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, can be expressed in terms of the previously computed derivative <inline-formula id="ieqn-604"><mml:math id="mml-ieqn-604"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and other quantities for layer <inline-formula id="ieqn-605"><mml:math id="mml-ieqn-605"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as follows:</p>
<p><disp-formula id="eqn-100"><label>(100)</label><mml:math id="mml-eqn-100" display="block"><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>k</mml:mi></mml:munder><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>k</mml:mi></mml:munder><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>k</mml:mi></mml:munder><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-101"><label>(101)</label><mml:math id="mml-eqn-101" display="block"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo stretchy='false'>(</mml:mo><mml:mtext>no&#x2009;sum&#x2009;on&#x2009;</mml:mtext><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-102"><label>(102)</label><mml:math id="mml-eqn-102" display="block"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2003;</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>row</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;&#x2009;</mml:mtext><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo stretchy='false'>(</mml:mo><mml:mtext>column</mml:mtext><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-103"><label>(103)</label><mml:math id="mml-eqn-103" display="block"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo stretchy='false'>(</mml:mo><mml:mtext>no&#x2009;sum&#x2009;on&#x00A0;</mml:mtext><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2003;</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Comparing Eq. (<xref ref-type="disp-formula" rid="eqn-101">101</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-98">98</xref>), when backpropagation reaches layer <inline-formula id="ieqn-606"><mml:math id="mml-ieqn-606"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the same row matrix</p>
<p><disp-formula id="eqn-104"><label>(104)</label><mml:math id="mml-eqn-104" display="block"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo stretchy='false'>(</mml:mo><mml:mtext>row</mml:mtext><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>is only needed to be computed once for use to compute both the gradient of the cost <inline-formula id="ieqn-607"><mml:math id="mml-ieqn-607"><mml:mi>J</mml:mi></mml:math></inline-formula> relative to the parameters <inline-formula id="ieqn-608"><mml:math id="mml-ieqn-608"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> [see Eq. (<xref ref-type="disp-formula" rid="eqn-98">98</xref>) and Figure <xref ref-type="fig" rid="fig-47">47</xref>]</p>
<p><disp-formula id="eqn-105"><label>(105)</label><mml:math id="mml-eqn-105" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and the gradient of the cost <inline-formula id="ieqn-609"><mml:math id="mml-ieqn-609"><mml:mi>J</mml:mi></mml:math></inline-formula> relative to the outputs <inline-formula id="ieqn-610"><mml:math id="mml-ieqn-610"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of layer <inline-formula id="ieqn-611"><mml:math id="mml-ieqn-611"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> [see Eq. (<xref ref-type="disp-formula" rid="eqn-101">101</xref>) and Figure <xref ref-type="fig" rid="fig-47">47</xref>]</p>
<p><disp-formula id="eqn-106"><label>(106)</label><mml:math id="mml-eqn-106" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mrow><mml:mi>&#x2113;</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The block diagram for backpropagation at layer <inline-formula id="ieqn-619"><mml:math id="mml-ieqn-619"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>&#x2014;as described in Eq. (<xref ref-type="disp-formula" rid="eqn-104">104</xref>), Eq. (<xref ref-type="disp-formula" rid="eqn-105">105</xref>), Eq. (<xref ref-type="disp-formula" rid="eqn-106">106</xref>)&#x2014;is given in Figure <xref ref-type="fig" rid="fig-47">47</xref>, and for a fully-connected network in Figure <xref ref-type="fig" rid="fig-48">48</xref>, with pseudocode given in Algorithm <xref ref-type="fig" rid="fig-159">1</xref>.</p>
</sec>
<sec id="s5_3"><label>5.3</label>
<title>Vanishing and exploding gradients</title>
<p>To demonstrate the vanishing gradient problem, a network is used in [<xref ref-type="bibr" rid="ref-21">21</xref>], having an input layer containing 784 neurons, corresponding to the <inline-formula id="ieqn-629"><mml:math id="mml-ieqn-629"><mml:mn>28</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>28</mml:mn><mml:mo>=</mml:mo><mml:mn>784</mml:mn></mml:math></inline-formula> pixels in the input image, four hidden layers, with each hidden layer containing 30 neurons, and an output layer containing 10 neurons, corresponding to the 10 possible classifications for the MNIST digits (&#x2019;0&#x2019;, &#x2019;1&#x2019;, &#x2019;2&#x2019;,..., &#x2019;9&#x2019;). A key ingredient is the use of the sigmoid function as active function; see Figure <xref ref-type="fig" rid="fig-30">30</xref>.</p>
<p>We note immediately that the vanishing / exploding gradient problem can be resolved using the rectified linear function (ReLu, Figure <xref ref-type="fig" rid="fig-24">24</xref>) as active function in combination with &#x201C;normalized initialization&#x201D;<xref ref-type="fn" rid="fn100"><sup>100</sup></xref><fn id="fn100"><label>100</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 295.</p></fn> and &#x201C;intermediate normalization layers&#x201D;, which are mentioned in [<xref ref-type="bibr" rid="ref-127">127</xref>], and which we will not discuss here.</p>
<p>The speed of learning of a hidden layer <inline-formula id="ieqn-630"><mml:math id="mml-ieqn-630"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-49">49</xref> is defined as the norm of the gradient <inline-formula id="ieqn-631"><mml:math id="mml-ieqn-631"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of the cost function <inline-formula id="ieqn-632"><mml:math id="mml-ieqn-632"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with respect to the parameters <inline-formula id="ieqn-633"><mml:math id="mml-ieqn-633"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in the hidden layer <inline-formula id="ieqn-634"><mml:math id="mml-ieqn-634"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-107"><label>(107)</label><mml:math id="mml-eqn-107" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2225;=</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mspace width="-1.0pt" /><mml:mrow><mml:mo>|</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>|</mml:mo></mml:mrow><mml:mspace width="-1.0pt" /><mml:mo>|</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The speed of learning in each of the four layers as a function of the number of epochs<xref ref-type="fn" rid="fn101"><sup>101</sup></xref><fn id="fn101"><label>101</label><p>An epoch is when all examples in the dataset had been used in a training session of the optimization process. For a formal definition of &#x201C;epoch&#x201D;, see Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref> on stochastic gradient descent (SGD) and Footnote <xref ref-type="fn" rid="fn145">145</xref>.</p></fn> of training drops down quickly after less than 50 training epochs, then plateaued out, as depicted in Figure <xref ref-type="fig" rid="fig-49">49</xref>, where the speed of learning of layer (1) was 100 times less than that of layer (4) after 400 training epochs.</p>
<fig id="fig-49">
<label>Figure 49</label>
<caption><title><italic>Vanishing gradient problem</italic> (Section <xref ref-type="sec" rid="s5_3">5.3</xref>). Speed of learning of earlier layers is much slower than that of later layers. Here, after 400 epochs of training, the speed of learning of Layer (1) at <inline-formula id="ieqn-646"><mml:math id="mml-ieqn-646"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (blue line) is 100 times slower than that of Layer (4) at <inline-formula id="ieqn-647"><mml:math id="mml-ieqn-647"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (green line); [<xref ref-type="bibr" rid="ref-21">21</xref>], Chapter 5, &#x2018;Why are deep neural networks hard to train ?&#x2019; <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by-nc/3.0/deed.en_GB.tif">(CC BY-NC 3.0)</ext-link>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-49.tif"/>
</fig>
<p>To understand the reason for the quick and significant decrease in the speed of learning, consider a network with four layers, having one scalar input <inline-formula id="ieqn-635"><mml:math id="mml-ieqn-635"><mml:mi>x</mml:mi></mml:math></inline-formula> with target scalar output <inline-formula id="ieqn-636"><mml:math id="mml-ieqn-636"><mml:mi>y</mml:mi></mml:math></inline-formula>, and predicted scalar output <inline-formula id="ieqn-637"><mml:math id="mml-ieqn-637"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, as shown in Figure <xref ref-type="fig" rid="fig-50">50</xref>, where each layer has one neuron.<xref ref-type="fn" rid="fn102"><sup>102</sup></xref><fn id="fn102"><label>102</label><p>See also [<xref ref-type="bibr" rid="ref-21">21</xref>].</p></fn> The cost function and its derivative are</p>
<p><disp-formula id="eqn-108"><label>(108)</label><mml:math id="mml-eqn-108" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-50">
<label>Figure 50</label>
<caption><title><italic>Neural network with four layers</italic> (Section <xref ref-type="sec" rid="s5_3">5.3</xref>), one neuron per layer, scalar input <inline-formula id="ieqn-648"><mml:math id="mml-ieqn-648"><mml:mi>x</mml:mi></mml:math></inline-formula>, scalar output <inline-formula id="ieqn-649"><mml:math id="mml-ieqn-649"><mml:mi>y</mml:mi></mml:math></inline-formula>, cost function <inline-formula id="ieqn-650"><mml:math id="mml-ieqn-650"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-651"><mml:math id="mml-ieqn-651"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>4</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> being the target output and also the output of layer <inline-formula id="ieqn-652"><mml:math id="mml-ieqn-652"><mml:mo stretchy="false">(</mml:mo><mml:mn>4</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, such that <inline-formula id="ieqn-653"><mml:math id="mml-ieqn-653"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-654"><mml:math id="mml-ieqn-654"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> being the active function, <inline-formula id="ieqn-655"><mml:math id="mml-ieqn-655"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, for <inline-formula id="ieqn-656"><mml:math id="mml-ieqn-656"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula>, and the network parameters are <inline-formula id="ieqn-657"><mml:math id="mml-ieqn-657"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. The detailed block diagram is in Figure <xref ref-type="fig" rid="fig-51">51</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-50.tif"/>
</fig>
<p>The neuron in layer <inline-formula id="ieqn-638"><mml:math id="mml-ieqn-638"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> accepts the scalar input <inline-formula id="ieqn-639"><mml:math id="mml-ieqn-639"><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> to produce the scalar output <inline-formula id="ieqn-640"><mml:math id="mml-ieqn-640"><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> according to</p>
<p><disp-formula id="eqn-109"><label>(109)</label><mml:math id="mml-eqn-109" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>As an example of computing the gradient, the derivative of the cost function <inline-formula id="ieqn-641"><mml:math id="mml-ieqn-641"><mml:mi>J</mml:mi></mml:math></inline-formula> with respect to the bias <inline-formula id="ieqn-642"><mml:math id="mml-ieqn-642"><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of layer <inline-formula id="ieqn-643"><mml:math id="mml-ieqn-643"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is given by</p>
<p><disp-formula id="eqn-110"><label>(110)</label><mml:math id="mml-eqn-110" display="block"><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>4</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>4</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>]</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>3</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>3</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>]</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>]</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The back propagation procedure to compute the gradient <inline-formula id="ieqn-644"><mml:math id="mml-ieqn-644"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-110">110</xref>) is depicted in Figure <xref ref-type="fig" rid="fig-51">51</xref>, which is a particular case of the more general Figure <xref ref-type="fig" rid="fig-48">48</xref>.</p>
<fig id="fig-51">
<label>Figure 51</label>
<caption><title><italic>Neural network with four layers</italic> in Figure <xref ref-type="fig" rid="fig-50">50</xref> (Section <xref ref-type="sec" rid="s5_3">5.3</xref>). Detailed block diagram. Forward propagation (blue arrows) and backpropagation (red arrows). In the forward propagation wave, at each layer <inline-formula id="ieqn-658"><mml:math id="mml-ieqn-658"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the product <inline-formula id="ieqn-659"><mml:math id="mml-ieqn-659"><mml:mrow><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is computed and stored, awaiting for the chain-rule derivative to arrive at this layer to multiply. The cost function <inline-formula id="ieqn-660"><mml:math id="mml-ieqn-660"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is computed together with its derivative <inline-formula id="ieqn-661"><mml:math id="mml-ieqn-661"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>y</mml:mi></mml:math></inline-formula>, which is the backprop starting point, from which, when following the backpropagation red arrow, the order of the factors are as in Eq. (<xref ref-type="disp-formula" rid="eqn-110">110</xref>), until the derivative <inline-formula id="ieqn-662"><mml:math id="mml-ieqn-662"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is reached at the head of the backprop red arrow. (Only the weights are shown, not the biases, which are not needed in the back propagation, to save space.) The speed of learning is slowed down significantly in early layers due to vanishing gradient, as shown in Figure <xref ref-type="fig" rid="fig-49">49</xref>. See also the more general case in Figure <xref ref-type="fig" rid="fig-48">48</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-51.tif"/>
</fig>
<p>Whether the gradient <inline-formula id="ieqn-768"><mml:math id="mml-ieqn-768"><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-110">110</xref>) vanishes or explodes depends on the magnitude of its factors</p>
<p><disp-formula id="eqn-111"><label>(111)</label><mml:math id="mml-eqn-111" display="block"><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x007C;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x2200;</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x21D2;</mml:mo><mml:mtext>Vanishing&#x2009;gradient</mml:mtext></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-112"><label>(112)</label><mml:math id="mml-eqn-112" display="block"><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x007C;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x003E;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x2200;</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x21D2;</mml:mo><mml:mtext>Exploding&#x2009;gradient</mml:mtext></mml:mrow></mml:math></disp-formula></p>
<p>In other mixed cases, the problem of vanishing or exploding gradient could be alleviated by the changing of the magnitude <inline-formula id="ieqn-769"><mml:math id="mml-ieqn-769"><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:math></inline-formula>, above 1 and below 1, from layer to layer.</p>
<statement id="st5_5"><title><xref ref-type="statement" rid="st5_5">Remark 5.5</xref>.</title>
<p>While the vanishing gradient problem for multilayer networks (static case) may be alleviated by weights that vary from layer to layer (the mixed cases mentioned above), this problem is especially critical in the case of Recurrent Neural Networks, since the weights stay constant for all state numbers (or &#x201C;time&#x201D;) in a sequence of data. See Remark <xref ref-type="statement" rid="st7_3">7.3</xref> on &#x201C;short-term memory&#x201D; in Section <xref ref-type="sec" rid="s7_2">7.2</xref> on Long Short-Term Memory. In back-propagation through the states in a sequence of data, from the last state back to the first state, the same weight keeps being multiplied by itself. Hence, when a weight is less than 1, successive powers of its magnitude eventually decrease to zero when progressing back the first state.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<sec id="s5_3_1"><label>5.3.1</label>
<title>Logistic sigmoid and hyperbolic tangent</title>
<p>The first derivatives of the sigmoid function and hyperbolic tangent function depicted in Figure <xref ref-type="fig" rid="fig-30">30</xref> (also in Remark <xref ref-type="statement" rid="st5_3">5.3</xref> on the softmax function and Figure <xref ref-type="fig" rid="fig-46">46</xref>) are given below:</p>
<p><disp-formula id="eqn-113"><label>(113)</label><mml:math id="mml-eqn-113" display="block"><mml:mrow><mml:mi>a</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>&#x21D2;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant='fraktur'>s</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-114"><label>(114)</label><mml:math id="mml-eqn-114" display="block"><mml:mrow><mml:mi>a</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x21D2;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>tanh</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>&#x2208;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>and are less than 1 in magnitude (everywhere for the sigmoid function, and almost everywhere for the hyperbolic tangent tanh function), except at <inline-formula id="ieqn-770"><mml:math id="mml-ieqn-770"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, where the derivative of the tanh function is equal to 1; Figure <xref ref-type="fig" rid="fig-52">52</xref>. Successive multiplications of these derivatives will result in smaller and smaller values along the back propagation path. If the weights <inline-formula id="ieqn-771"><mml:math id="mml-ieqn-771"><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-110">110</xref>) are also smaller than 1, then the gradient <inline-formula id="ieqn-772"><mml:math id="mml-ieqn-772"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> will tend toward 0, i.e., vanish. The problem is further exacerbated in deeper networks with increasing number of layers, and thus increasing number of factors less than 1 (i.e., <inline-formula id="ieqn-773"><mml:math id="mml-ieqn-773"><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x007C;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula>). We have encountered the vanishing-gradient problem.</p>
<fig id="fig-52">
<label>Figure 52</label>
<caption><title><italic>Sigmoid and hyperbolic tangent functions, derivative</italic> (Section <xref ref-type="sec" rid="s5_3_1">5.3.1</xref>). The derivative of sigmoid function (<inline-formula id="ieqn-663"><mml:math id="mml-ieqn-663"><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant='fraktur'>s</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:math></inline-formula>, green line) is less than 1 everywhere, whereas the derivative of the hyperbolic tangent (<inline-formula id="ieqn-664"><mml:math id="mml-ieqn-664"><mml:mrow><mml:msup><mml:mrow><mml:mi>tanh</mml:mi></mml:mrow><mml:mo>&#x02B9;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, purple line) is less than 1 everywhere, except at the abscissa <inline-formula id="ieqn-665"><mml:math id="mml-ieqn-665"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, where it is equal to 1.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-52.tif"/>
</fig>
<p>The exploding gradient problem is opposite to the vanishing gradient problem, and occurs when the gradient has its magnitude increases in subsequent multiplications, particularly at a &#x201C;cliff&#x201D;, which is a sharp drop in the cost function in the parameter space.<xref ref-type="fn" rid="fn103"><sup>103</sup></xref><fn id="fn103"><label>103</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 281</p></fn> The gradient at the brink of a cliff (Figure <xref ref-type="fig" rid="fig-53">53</xref>) leads to large-magnitude gradients, which when multiplied with each other several times along the back propagation path would result in an exploding gradient problem.</p></sec>
<sec id="s5_3_2"><label>5.3.2</label>
<title>Rectified linear function (ReLU)</title>
<p>The rectified linear function depicted in Figure <xref ref-type="fig" rid="fig-24">24</xref> with its derivative (Heaviside function) equal to 1 for any input greater than zero, would resolve the vanishing-gradient problem, as it is written in [<xref ref-type="bibr" rid="ref-113">113</xref>]:</p>
<disp-quote><p>&#x201C;For a given input only a subset of neurons are active. Computation is linear on this subset ... Because of this linearity, gradients flow well on the active paths of neurons (there is no gradient vanishing effect due to activation non-linearities of sigmoid or tanh units), and mathematical investigation is easier. Computations are also cheaper: there is no need for computing the exponential function in activations, and sparsity can be exploited.&#x201D;</p>
</disp-quote>
<fig id="fig-53">
<label>Figure 53</label>
<caption><title><italic>Cost-function cliff</italic> (Section <xref ref-type="sec" rid="s5_3_1">5.3.1</xref>). A cliff, or a sharp drop in the cost function. The parameter space is represented by a weight <inline-formula id="ieqn-666"><mml:math id="mml-ieqn-666"><mml:mi>w</mml:mi></mml:math></inline-formula> and a bias <inline-formula id="ieqn-667"><mml:math id="mml-ieqn-667"><mml:mi>b</mml:mi></mml:math></inline-formula>. The slope at the brink of the cliff leads to large-magnitude gradients, which when multiplied with each other several times along the back propagation path would result in an exploding gradient problem. [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 281. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-53.tif"/>
</fig>
<p>A problem with ReLU was that some neurons were never activated, and called &#x201C;dying&#x201D; or &#x201C;dead&#x201D;, as described in [<xref ref-type="bibr" rid="ref-131">131</xref>]:</p>
<disp-quote><p>&#x201C;However, ReLU units are at a potential disadvantage during optimization because the gradient is 0 whenever the unit is not active. This could lead to cases where a unit never activates as a gradient-based optimization algorithm will not adjust the weights of a unit that never activates initially. Further, like the vanishing gradients problem, we might expect learning to be slow when training ReL networks with constant 0 gradients.&#x201D;</p>
</disp-quote><p>To remedy this &#x201C;dying&#x201D; or &#x201C;dead&#x201D; neuron problem, the Leaky ReLU, proposed in [<xref ref-type="bibr" rid="ref-131">131</xref>],<xref ref-type="fn" rid="fn104"><sup>104</sup></xref><fn id="fn104"><label>104</label><p>According to Google Scholar, [<xref ref-type="bibr" rid="ref-113">113</xref>] (2011) received 3,656 citations on 2019.10.13 and 8,815 citations on 2022.06.23, whereas [<xref ref-type="bibr" rid="ref-131">131</xref>] (2013) received 2,154 and 6,380 citations on these two respective dates.</p></fn> had the expression already given previously in Eq. (<xref ref-type="disp-formula" rid="eqn-40">40</xref>), and can be viewed as an approximation to the leaky diode in Figure <xref ref-type="fig" rid="fig-29">29</xref>. Both ReLU and Leaky ReLU have been known and used in neuroscience for years before being imported into artificial neural network; see Section <xref ref-type="sec" rid="s13">13</xref> for a historical review.</p>
</sec>
<sec id="s5_3_3"><label>5.3.3</label>
<title>Parametric rectified linear unit (PReLU)</title>
<p>Instead of arbitrarily fixing the slope <inline-formula id="ieqn-774"><mml:math id="mml-ieqn-774"><mml:mi>s</mml:mi></mml:math></inline-formula> of the Leaky ReLU at <inline-formula id="ieqn-775"><mml:math id="mml-ieqn-775"><mml:mn>0.01</mml:mn></mml:math></inline-formula> for negative <inline-formula id="ieqn-776"><mml:math id="mml-ieqn-776"><mml:mi>z</mml:mi></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-40">40</xref>), it is proposed to leave this slope <inline-formula id="ieqn-777"><mml:math id="mml-ieqn-777"><mml:mi>s</mml:mi></mml:math></inline-formula> as a free parameter to optimize along with the weights and biases [<xref ref-type="bibr" rid="ref-61">61</xref>]; see Figure <xref ref-type="fig" rid="fig-54">54</xref>:</p>
<p><disp-formula id="eqn-115"><label>(115)</label><mml:math id="mml-eqn-115" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>z</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>z</mml:mi></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>z</mml:mi></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mn>0</mml:mn><mml:mo>&lt;</mml:mo><mml:mi>z</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and thus the network adaptively learned the parameters to control the leaky part of the activation function. Using the Parametric ReLU in Eq.(<xref ref-type="disp-formula" rid="eqn-115">115</xref>), a deep convolutional neural network (CNN) in [<xref ref-type="bibr" rid="ref-61">61</xref>] was able to surpass the level of human performance in image recognition for the first time in 2015; see Figure <xref ref-type="fig" rid="fig-3">3</xref> on ImageNet competition results over the years.</p>
<fig id="fig-54">
<label>Figure 54</label>
<caption><title><italic>Rectified Linear Unit</italic> (ReLU, left) and <italic>Parametric ReLU</italic> (right) (Section <xref ref-type="sec" rid="s5_3_2">5.3.2</xref>), in which the slope <inline-formula id="ieqn-668"><mml:math id="mml-ieqn-668"><mml:mi>s</mml:mi></mml:math></inline-formula> is a parameter to optimize; see Section <xref ref-type="sec" rid="s5_3_3">5.3.3</xref>. See also Figure <xref ref-type="fig" rid="fig-24">24</xref> on ReLU.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-54.tif"/>
</fig>
<fig id="fig-55">
<label>Figure 55</label>
<caption><title><italic>Cost-function landscape</italic> (Section <xref ref-type="sec" rid="s6">6</xref>). Residual network with 56 layers (ResNet-56) on the CIFAR-10 training set. Highly non-convex, with many local minima, and deep, narrow valleys [<xref ref-type="bibr" rid="ref-132">132</xref>]. The training error and test error for fully-connected network increased when the number of layers was increased from 20 to 56, Figure <xref ref-type="fig" rid="fig-43">43</xref>, motivating the introduction of residual network, Figure <xref ref-type="fig" rid="fig-44">44</xref> and Figure <xref ref-type="fig" rid="fig-45">45</xref>, Section <xref ref-type="sec" rid="s4_6_2">4.6.2</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-55.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s6"><label>6</label>
<title>Network training, optimization methods</title>
<p>For network training, i.e., to find the optimal network parameters <inline-formula id="ieqn-778"><mml:math id="mml-ieqn-778"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> that minimize the cost function <inline-formula id="ieqn-779"><mml:math id="mml-ieqn-779"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, we describe here both deterministic optimization methods used in full-batch mode,<xref ref-type="fn" rid="fn105"><sup>105</sup></xref><fn id="fn105"><label>105</label><p>A &#x201C;full batch&#x201D; is a complete training set of examples; see Footnote <xref ref-type="fn" rid="fn117">117</xref>.</p></fn> and stochastic optimization methods used in minibatch<xref ref-type="fn" rid="fn106"><sup>106</sup></xref><fn id="fn106"><label>106</label><p>A minibatch is a random subset of the training set, which is called here the &#x201C;full batch&#x201D;; see Footnote <xref ref-type="fn" rid="fn117">117</xref>.</p></fn> mode.</p>
<p>Figure <xref ref-type="fig" rid="fig-55">55</xref> shows the highly non-convex landscape of the cost function of a residual network with 56 layers trained using the CIFAR-10 dataset (Canadian Institute For Advanced Research), a collection of images commonly used to train machine learning and computer vision algorithms, containing 60,000 32x32 color images in 10 different classes, representing airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each class has 6,000 images.<xref ref-type="fn" rid="fn107"><sup>107</sup></xref><fn id="fn107"><label>107</label><p>See &#x201C;CIFAR-10&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=CIFAR-10&amp;oldid=921224113">version 16:44, 14 October 2019</ext-link>.</p></fn></p>
<p>Deterministic optimization methods (Section <xref ref-type="sec" rid="s6_2">6.2</xref>) include first-order gradient method (Algorithm <xref ref-type="fig" rid="fig-160">2</xref>) and second-order quasi-Newton method (Algorithm <xref ref-type="fig" rid="fig-161">3</xref>), with line searches based on different rules, introduced by Goldstein, Armijo, and Wolfe.</p>
<p>Stochastic optimization methods (Section <xref ref-type="sec" rid="s6_3">6.3</xref>) include
<list list-type="bullet">
<list-item><p>First-order <xref ref-type="sec" rid="s6_3_1">stochastic gradient descent</xref> (<xref ref-type="sec" rid="s6_3_1">SGD</xref>) methods (Algorithm <xref ref-type="fig" rid="fig-162">4</xref>), with <xref ref-type="sec" rid="s6_3_1">add-on tricks</xref> such as momentum and accelerated gradient</p></list-item>
<list-item><p><xref ref-type="sec" rid="s6_5">Adaptive learning-rate algorithms</xref> (Algorithm <xref ref-type="fig" rid="fig-163">5</xref>): <xref ref-type="sec" rid="s6_5_6">Adam</xref> and variants such as <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref>, <xref ref-type="sec" rid="s6_5_10">AdamW</xref>, etc. that are popular in the machine-learning community</p></list-item>
<list-item><p><xref ref-type="sec" rid="s6_5_9">Criticism of adaptive methods</xref> and SGD resurgence with <xref ref-type="sec" rid="s6_3_1">add-on tricks</xref> such as effective tuning and step-length decay (or annealing)</p></list-item>
<list-item><p>Classical line search with stochasticity: <xref ref-type="sec" rid="s6_6">SGD with Armijo line search</xref> (Algorithm <xref ref-type="fig" rid="fig-164">6</xref>), second-order <xref ref-type="sec" rid="s6_7">Newton method with Armijo-like line search</xref> (Algorigthm 7)</p></list-item></list></p>
<fig id="fig-56">
<label>Figure 56</label>
<caption><title><italic>Training set, validation set, test set</italic> (Section <xref ref-type="sec" rid="s6_1">6.1</xref>). Partition of whole dataset. The examples are independent. The three subsets are identically distributed.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-56.tif"/>
</fig>
<sec id="s6_1"><label>6.1</label>
<title>Training set, validation set, test set, stopping criteria</title>
<p>The classical (old) thinking&#x2014;starting in 1992 with [<xref ref-type="bibr" rid="ref-133">133</xref>] and exemplified by Figures <xref ref-type="fig" rid="fig-57">57</xref>, <xref ref-type="fig" rid="fig-58">58</xref>, <xref ref-type="fig" rid="fig-59">59</xref>, <xref ref-type="fig" rid="fig-60">60</xref> (a, left)&#x2014;would surprise first-time learners that <italic>minimizing the training error is not optimal</italic> in machine learning. A reason is that training a neural network is <italic>different</italic> from using &#x201C;pure optimization&#x201D; since it is desired to decrease not only the error during training (called training error, and that&#x2019;s pure optimization), but also the error committed by a trained network on inputs never seen before.<xref ref-type="fn" rid="fn108"><sup>108</sup></xref><fn id="fn108"><label>108</label><p>See also [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 268, Section 8.1, &#x201C;How learning differs from pure optimization&#x201D;; that&#x2019;s the classical thinking.</p></fn> Such error is called generalization error or test error. This classical thinking, known as the <italic>bias-variance trade-off</italic>, has been included in books since 2001 [<xref ref-type="bibr" rid="ref-134">134</xref>] (p. 194) and even repeated in 2016 [<xref ref-type="bibr" rid="ref-78">78</xref>] (p. 268). Models with lower number of parameters have higher bias and lower variance, whereas models with higher number of parameters have lower bias and higher variance; Figure <xref ref-type="fig" rid="fig-59">59</xref>.<xref ref-type="fn" rid="fn109"><sup>109</sup></xref><fn id="fn109"><label>109</label><p>See [<xref ref-type="bibr" rid="ref-134">134</xref>], p. 11, for a classification example using two methods: (1) linear models and least squares and (2) k-nearest neighbors. &#x201C;The linear model makes huge assumptions about structure [high bias] and yields stable [low variance] but possibly inaccurate predictions [high training error]. The method of k-nearest neighbors makes very mild structural assumptions [low bias]: its predictions are often accurate [low training error] but can be unstable [high variance].&#x201D; See also [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 151, Figure 3.6.</p></fn></p>
<p>The modern thinking is exemplified by Figure <xref ref-type="fig" rid="fig-60">60</xref> (b, right) and Figure <xref ref-type="fig" rid="fig-61">61</xref>, and does not contradict the intuitive notion that <italic>decreasing the training error to zero is indeed desirable</italic>, as overparameterizing networks beyond the interpolation threshold (zero training error) in modern practice generalizes well (small test error). In Figure <xref ref-type="fig" rid="fig-61">61</xref>, the test error continued to decrease significantly with increasing number of parameters <inline-formula id="ieqn-780"><mml:math id="mml-ieqn-780"><mml:mi>N</mml:mi></mml:math></inline-formula> beyond the interpolation threshold <inline-formula id="ieqn-781"><mml:math id="mml-ieqn-781"><mml:msup><mml:mi>N</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mn>825</mml:mn></mml:math></inline-formula>, whereas the classical regime (<inline-formula id="ieqn-782"><mml:math id="mml-ieqn-782"><mml:mi>N</mml:mi><mml:mo>&#x003C;</mml:mo><mml:msup><mml:mi>N</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula>) with the bias-variance trade-off (blue) in Figure <xref ref-type="fig" rid="fig-59">59</xref> was restrictive, and did not generalize as well (larger test error). Beyond the interpolation threshold <inline-formula id="ieqn-783"><mml:math id="mml-ieqn-783"><mml:msup><mml:mi>N</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula>, variance can be decreased by using ensemble average, as shown by the orange line in Figure <xref ref-type="fig" rid="fig-61">61</xref>.</p>
<p>Such modern practice was the motivation for research into shallow networks with <italic>infinite</italic> width as a first step to understand how overparameterized networks worked so well; see Figure <xref ref-type="fig" rid="fig-148">148</xref> and Section <xref ref-type="sec" rid="s14_2">14.2</xref> &#x201C;Lack of understanding on why deep learning worked.&#x201D;</p>
<fig id="fig-57">
<label>Figure 57</label>
<caption><title><italic>Training and validation learning curves&#x2014;Classical viewpoint</italic> (Section <xref ref-type="sec" rid="s6_1">6.1</xref>), i.e., plots of training error and validation errors versus epoch number (time). While the training cost decreased continuously, the validation cost reaches a minimum around epoch 20, then started to gradually increase, forming an &#x201C;asymmetric U-shaped curve.&#x201D; Between epoch 100 and epoch 240, the training error was essentially flat, indicating convergence. Adapted from [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 239. See Figure <xref ref-type="fig" rid="fig-60">60</xref> (a, left), where the classical risk curve is the classical viewpoint, whereas the modern interpolation viewpoint is on the right subfigure (b). (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-57.tif"/>
</fig>
<p>To develop a neural-network model, a dataset governed by the same probability distribution, such as the CIFAR-10 dataset mentioned above, can be typically divided into three non-overlapping subsets called <italic>training set, validation set</italic>, and <italic>test set</italic>. The validation set is also called the <italic>development set</italic>, a terminology used in [<xref ref-type="bibr" rid="ref-55">55</xref>], in which an effective method of step-length decay was proposed; see Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref>.</p>
<p>It was suggested in [<xref ref-type="bibr" rid="ref-135">135</xref>], p. 61, to use 50% of the dataset as training set, 25% as validation set, and 25% as test set. On the other hand, while a validation set with size about <inline-formula id="ieqn-784"><mml:math id="mml-ieqn-784"><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>4</mml:mn></mml:math></inline-formula> of the training set was suggested in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 118, there was no suggestion for the relative size of the test set.<xref ref-type="fn" rid="fn110"><sup>110</sup></xref><fn id="fn110"><label>110</label><p>Andrew Ng suggested the following partitions. For small datasets having less than <inline-formula id="ieqn-3118a"><mml:math id="mml-ieqn-3118a"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> examples, the training/validation/test ratio of <inline-formula id="ieqn-3119a"><mml:math id="mml-ieqn-3119a"><mml:mn>60</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>20</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>20</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> could be used. For large datasets with order of <inline-formula id="ieqn-3120a"><mml:math id="mml-ieqn-3120a"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> examples, use ratio <inline-formula id="ieqn-3121"><mml:math id="mml-ieqn-3121"><mml:mn>98</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula>. For datasets with much more than <inline-formula id="ieqn-3122"><mml:math id="mml-ieqn-3122"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> examples, use ratio <inline-formula id="ieqn-3123"><mml:math id="mml-ieqn-3123"><mml:mn>99.5</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>0.25</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>0.25</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula>. See Coursera course &#x201C;Improving deep neural network: Hyperparameter tuning, regularization and optimization&#x201D;, at time 4:00, <ext-link ext-link-type="uri" xlink:href="https://www.coursera.org/lecture/deep-neural-network/train-dev-test-sets-cxG1s">video website</ext-link>.</p></fn> See Figure <xref ref-type="fig" rid="fig-56">56</xref> for a conceptual partition of the dataset.</p>
<p>Examples in the training set are fed into an optimizer to find the network parameter estimate <inline-formula id="ieqn-785"><mml:math id="mml-ieqn-785"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> that minimizes the cost function estimate <inline-formula id="ieqn-786"><mml:math id="mml-ieqn-786"><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.<xref ref-type="fn" rid="fn111"><sup>111</sup></xref><fn id="fn111"><label>111</label><p>The word &#x201C;estimate&#x201D; is used here for the more general case of stochastic optimization with minibatches; see Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref> on stochastic gradient descent and subsequent sections on stochastic algorithms. When deterministic optimization is used with the full batch of dataset, then the cost estimate is the same as the cost, i.e., <inline-formula id="ieqn-3124"><mml:math id="mml-ieqn-3124"><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mo>&#x2261;</mml:mo><mml:mi>J</mml:mi></mml:math></inline-formula>, and the network parameter estimates are the same as the network parameters, i.e., <inline-formula id="ieqn-3125"><mml:math id="mml-ieqn-3125"><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mo>&#x2261;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula>.</p></fn> As the optimization on the training set progresses from epoch to epoch,<xref ref-type="fn" rid="fn112"><sup>112</sup></xref><fn id="fn112"><label>112</label><p>An epoch is when all examples in the dataset had been used in a training session of the optimization process. For a formal definition of &#x201C;epoch&#x201D;, see Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref> on stochastic gradient descent (SGD) and Footnote <xref ref-type="fn" rid="fn145">145</xref>.</p></fn> examples in the validation set are fed as inputs into the network to obtain the outputs for computing the cost function <inline-formula id="ieqn-787"><mml:math id="mml-ieqn-787"><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, also called validation error, at predetermined epochs <inline-formula id="ieqn-788"><mml:math id="mml-ieqn-788"><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>&#x03BA;</mml:mi></mml:msub></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> using the network parameters <inline-formula id="ieqn-789"><mml:math id="mml-ieqn-789"><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>&#x03BA;</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> obtained from the optimization on the training set at those epochs.</p>
<fig id="fig-58">
<label>Figure 58</label>
<caption><title><italic>Validation learning curve</italic> (Section <xref ref-type="sec" rid="s6_1">6.1</xref>, Algorithm <xref ref-type="fig" rid="fig-162">4</xref>). Validation error vs epoch number. Some validation error could oscillate wildly around the mean, resulting in an &#x201C;ugly reality&#x201D;. The global minimum validation error corresponded to epoch number <inline-formula id="ieqn-669"><mml:math id="mml-ieqn-669"><mml:msup><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula>. Since the stopping criteria may miss this global minimum, it was suggested to monitor the validation learning curve to find the epoch <inline-formula id="ieqn-670"><mml:math id="mml-ieqn-670"><mml:msup><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> at which the network parameters <inline-formula id="ieqn-671"><mml:math id="mml-ieqn-671"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:mrow></mml:msub></mml:math></inline-formula> would be declared optimal. Adapted from [<xref ref-type="bibr" rid="ref-135">135</xref>], p. 55. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-58.tif"/>
</fig>
<p>Figure <xref ref-type="fig" rid="fig-57">57</xref> shows the different behaviour of the training error versus that of the validation error. The validation error would decrease quickly initially, reaching a global minimum, then gradually increased, whereas the training error continued to decrease and plateaued out, indicating that the gradients got smaller and smaller, and there was not much decrease in the cost. From epoch 100 to epoch 240, the traning error was at about the same level, with litte noise. The validation error, on the other hand, had a lot of noise.</p>
<p>Because of the &#x201C;asymmetric U-shaped curve&#x201D; of the validation error, the thinking was that if the optimization process could stop early at the global mininum of the validation error, then the generalization (test) error, i.e., the value of cost function on the test set, would also be small, thus the name &#x201C;<italic>early stopping</italic>&#x201D;. The test set contains examples that have not been used to train the network, thus simulating inputs never seen before. The validation error could have oscillations with large amplitude around a mean curve, with many local minima; see Figure <xref ref-type="fig" rid="fig-58">58</xref>.</p>
<p>The difference between the test (generalization) error and the validation error is called the generalization gap, as shown in the <italic>bias-variance trade-off</italic> [<xref ref-type="bibr" rid="ref-133">133</xref>] Figure <xref ref-type="fig" rid="fig-59">59</xref>, which qualitatively delineates these errors versus model capacity, and conceptually explains the optimal model capacity as where the generalization gap equals the training error, or the generalization error is twice the training error.</p> 
<statement id="st6_1"><title>Remark 6.1.</title>
<p>Even the best machine learning generalization capability nowadays still cannot compete with the generalization ability of human babies; see Section <xref ref-type="sec" rid="s14_6">14.6</xref> on &#x201C;What&#x2019;s new? Teaching machines to think like babies&#x201D;.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p><bold>Early-stopping criteria.</bold> One criterion is to first define the lowest validation error from epoch 1 up to the current epoch <inline-formula id="ieqn-790"><mml:math id="mml-ieqn-790"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> as:</p>
<fig id="fig-59">
<label>Figure 59</label>
<caption><title><italic>Bias-variance trade-off</italic> (Section <xref ref-type="sec" rid="s6_1">6.1</xref>). Training error (cost) and test error versus model capacity. Two ways to change the model capacity: (1) change the number of network parameters, (2) change the values of these parameters (weight decay). The generalization gap is the difference between the test (generalization) error and the training error. As the model capacity increases from underfit to overfit, the training error decreases, but the generalization gap increases, past the optimal capacity. Figure <xref ref-type="fig" rid="fig-72">72</xref> gives examples of underfit, appropriately fit, overfit. See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 112. The above is the classical viewpoint, which is still prevalent [<xref ref-type="bibr" rid="ref-136">136</xref>]; see Figure <xref ref-type="fig" rid="fig-60">60</xref> for the modern viewpoint, in which overfitting with high capacity model generalizes well (small test error) in practice. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-59.tif"/>
</fig>
<p><disp-formula id="eqn-116"><label>(116)</label><mml:math id="mml-eqn-116" display="block"><mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>&#x03C4;</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>&#x03C4;</mml:mi></mml:mrow></mml:munder><mml:mo>&#x007B;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2032;</mml:mo></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x007D;</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>then define the <italic>generalization loss</italic> (in percentage) at epoch <inline-formula id="ieqn-791"><mml:math id="mml-ieqn-791"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> as the increase in validation error relative to the minimum validation error from epoch 1 to the present epoch <inline-formula id="ieqn-792"><mml:math id="mml-ieqn-792"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-117"><label>(117)</label><mml:math id="mml-eqn-117" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>G</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>100</mml:mn><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>[<xref ref-type="bibr" rid="ref-135">135</xref>] then defined the &#x201C;first class of stopping criteria&#x201D; as follows: Stop the optimization on the training set when the generalization loss exceeds a certain threshold <inline-formula id="ieqn-793"><mml:math id="mml-ieqn-793"><mml:mrow><mml:mtext>&#x1D5CC;</mml:mtext></mml:mrow></mml:math></inline-formula> (generalization loss lower bound):</p>
<p><disp-formula id="eqn-118"><label>(118)</label><mml:math id="mml-eqn-118" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mtext>&#x1D5CC;</mml:mtext></mml:mrow></mml:msub><mml:mo>:</mml:mo><mml:mrow><mml:mtext>Stop after epoch&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mtext>&#x00A0;if&#x00A0;</mml:mtext></mml:mrow><mml:mi>G</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&gt;</mml:mo><mml:mrow><mml:mtext>&#x1D5CC;</mml:mtext></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The issue is how to determine the generalization loss lower bound <inline-formula id="ieqn-794"><mml:math id="mml-ieqn-794"><mml:mrow><mml:mtext>&#x1D5CC;</mml:mtext></mml:mrow></mml:math></inline-formula> so not to fall into a local minimum, and to catch the global minimum; see Figure <xref ref-type="fig" rid="fig-58">58</xref>. There were many more early-stopping criterion classes in [<xref ref-type="bibr" rid="ref-135">135</xref>]. But it is not clear whether all these increasingly sophisticated stopping criteria would work to catch the validation-error global minimum in Figure <xref ref-type="fig" rid="fig-58">58</xref>.</p>
<p>Moreover, the above discussion is for the <italic>classical regime</italic> in Figure <xref ref-type="fig" rid="fig-60">60</xref> (a). In the context of the <italic>modern interpolation regime</italic> in Figure <xref ref-type="fig" rid="fig-60">60</xref> (b), early stopping means that the computation would cease as soon as the training error reaches &#x201C;its lowest possible value (typically zero [beyond the interpolation threshold], unless two identical data points have two different labels)&#x201D; [<xref ref-type="bibr" rid="ref-137">137</xref>]. See the green line in Figure <xref ref-type="fig" rid="fig-61">61</xref>.</p>
<p><bold>Computational budget, learning curves.</bold> A simple method would be to set an epoch budget, i.e., the largest number of epochs for computation sufficiently large for the training error to go down significantly, then monitor graphically both the training error (cost) and the validation error versus epoch number. These plots are called the <italic>learning curves</italic>; see Figure <xref ref-type="fig" rid="fig-57">57</xref>, for which an epoch budget of 240 was used. Select the global minimum of the validation learning curve, with epoch number <inline-formula id="ieqn-795"><mml:math id="mml-ieqn-795"><mml:msup><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> (Figure <xref ref-type="fig" rid="fig-57">57</xref>), and use the corresponding network parameters <inline-formula id="ieqn-796"><mml:math id="mml-ieqn-796"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:mrow></mml:msub></mml:math></inline-formula>, which were saved periodically, as optimal paramters for the network.<xref ref-type="fn" rid="fn113"><sup>113</sup></xref><fn id="fn113"><label>113</label><p>See also &#x201C;Method for early stopping in a neural network&#x201D;, StackExchange, 2018.03.05, <ext-link ext-link-type="uri" xlink:href="https://stats.stackexchange.com/questions/331821/method-for-early-stopping-in-a-neural-network">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20200109015935/https://stats.stackexchange.com/questions/331821/method-for-early-stopping-in-a-neural-network">Internet archive</ext-link>. [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 287, also suggested to monitor the training learning curve to adjust the step length (learning rate).</p></fn></p>
<fig id="fig-60">
<label>Figure 60</label>
<caption><title><italic>Modern interpolation regime</italic> (Sections <xref ref-type="sec" rid="s6_1">6.1</xref>, <xref ref-type="sec" rid="s14_2">14.2</xref>). Beyond the interpolation threshold, the test error goes down as the model capacity (e.g., number of parameters) increases, describing the observation that networks with high capacity beyond the interpolation threshold generalize well, even though overfit in training. Risk = error or cost. Capacity = number of parameters (but could also be increased by weight decay). <xref ref-type="fig" rid="fig-57">Figures 57</xref>, <xref ref-type="fig" rid="fig-59">59</xref> corresponds to the classical regime, i.e., old method (thinking) [<xref ref-type="bibr" rid="ref-136">136</xref>]. See Figure <xref ref-type="fig" rid="fig-61">61</xref> for experimental evidence of the modern interpolation regime, and Figure <xref ref-type="fig" rid="fig-148">148</xref> for a shallow network with infinite width. <ext-link ext-link-type="uri" xlink:href="https://www.pnas.org/page/about/rights-permissions">Permission of NAS</ext-link>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-60.tif"/>
</fig>
<statement id="st6_2"><title>Remark 6.2.</title>
<p>Since it is important to monitor the validation error during training, a whole section is devoted in [<xref ref-type="bibr" rid="ref-78">78</xref>] (Section <xref ref-type="sec" rid="s8_1">8.1</xref>, p. 268) to expound on &#x201C;How Learning Differs from Pure Optimization&#x201D;. And also for this reason, it is not clear yet what <italic>global optimization</italic> algorithms such as in [<xref ref-type="bibr" rid="ref-138">138</xref>] could bring to network training, whereas the stochastic gradient descent (SGD) in Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref> is quite efficient; see also Section <xref ref-type="sec" rid="s6_5_9">6.5.9</xref> on criticism of adaptive methods.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st6_3"><title>Remark 6.3.</title>
<p><italic>Epoch budget, global iteration budget</italic>. For stochastic optimization algorithms&#x2014;Sections <xref ref-type="sec" rid="s6_3">6.3</xref>, <xref ref-type="sec" rid="s6_5">6.5</xref>, <xref ref-type="sec" rid="s6_6">6.6</xref>, <xref ref-type="sec" rid="s6_7">6.7</xref>&#x2014;the epoch counter is <inline-formula id="ieqn-797"><mml:math id="mml-ieqn-797"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> and the epoch budget <inline-formula id="ieqn-798"><mml:math id="mml-ieqn-798"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Numerical experiments in Figure <xref ref-type="fig" rid="fig-73">73</xref> had an epoch budget of <inline-formula id="ieqn-799"><mml:math id="mml-ieqn-799"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>250</mml:mn></mml:math></inline-formula>, whereas numerical experiments in Figure <xref ref-type="fig" rid="fig-74">74</xref> had an epoch budget of <inline-formula id="ieqn-800"><mml:math id="mml-ieqn-800"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1800</mml:mn></mml:math></inline-formula>. The computational budget could be specified in terms of global iteration counter <inline-formula id="ieqn-801"><mml:math id="mml-ieqn-801"><mml:mi>j</mml:mi></mml:math></inline-formula> as <inline-formula id="ieqn-802"><mml:math id="mml-ieqn-802"><mml:msub><mml:mi>j</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Figure <xref ref-type="fig" rid="fig-71">71</xref> had a global iteration budget of <inline-formula id="ieqn-803"><mml:math id="mml-ieqn-803"><mml:msub><mml:mi>j</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>5000</mml:mn></mml:math></inline-formula>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>Before presenting the stochastic gradient-descent (SGD) methods in Section <xref ref-type="sec" rid="s6_3">6.3</xref>, it is important to note that classical deterministic methods of optimization in Section <xref ref-type="sec" rid="s6_2">6.2</xref> continue to be useful in the age of deep learning and SGD.</p>
<disp-quote><p>&#x201C;One should not lose sight of the fact that [full] batch approaches possess some intrinsic advantages. First, the use full gradient information at each iterate opens the door for many deterministic gradient-based optimization methods that have been developed over the past decades, including not only the full gradient method, but also accelerated gradient, conjugate gradient, quasi-Newton, inexact Newton methods, and can benefit from parallelization.&#x201D; [<xref ref-type="bibr" rid="ref-80">80</xref>], p. 237.</p>
</disp-quote></sec>
<sec id="s6_2"><label>6.2</label>
<title>Deterministic optimization, full batch</title>
<p>Once the gradient <inline-formula id="ieqn-804"><mml:math id="mml-ieqn-804"><mml:mtext>&#x2202;</mml:mtext><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of the cost function <inline-formula id="ieqn-805"><mml:math id="mml-ieqn-805"><mml:mi>J</mml:mi></mml:math></inline-formula> has been computed using backpropagation described in Section <xref ref-type="sec" rid="s5_2">5.2</xref>, the layer parameters <inline-formula id="ieqn-806"><mml:math id="mml-ieqn-806"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are updated to decrease cost function <inline-formula id="ieqn-807"><mml:math id="mml-ieqn-807"><mml:mi>J</mml:mi></mml:math></inline-formula> using gradient descent as follows:</p>
<fig id="fig-61">
<label>Figure 61</label>
<caption><title><italic>Empirical test error vs Number of paramesters</italic> (Sections <xref ref-type="sec" rid="s6_1">6.1</xref>, <xref ref-type="sec" rid="s14_2">14.2</xref>). Experiments using the <ext-link ext-link-type="uri" xlink:href="http://yann.lecun.com/exdb/mnist/.tif">MNIST handwritten digit database</ext-link> in [<xref ref-type="bibr" rid="ref-137">137</xref>] confirmed the modern interpolation regime in Figure <xref ref-type="fig" rid="fig-60">60</xref> [<xref ref-type="bibr" rid="ref-136">136</xref>]. <italic>Blue:</italic> Average over 20 runs. <italic>Green:</italic> Early stopping. <italic>Orange:</italic> Ensemble average on <inline-formula id="ieqn-672"><mml:math id="mml-ieqn-672"><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>20</mml:mn></mml:math></inline-formula> samples, trained independently. See Figure <xref ref-type="fig" rid="fig-148">148</xref> for a shallow network with infinite width. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-61.tif"/>
</fig>
<p><disp-formula id="eqn-119"><label>(119)</label><mml:math id="mml-eqn-119" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>:=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>being the gradient direction, and <inline-formula id="ieqn-808"><mml:math id="mml-ieqn-808"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> called the learning rate.<xref ref-type="fn" rid="fn114"><sup>114</sup></xref><fn id="fn114"><label>114</label><p>See Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.</p></fn> The layer-by-layer update in Eq. (<xref ref-type="disp-formula" rid="eqn-119">119</xref>) as soon as the gradient <inline-formula id="ieqn-809"><mml:math id="mml-ieqn-809"><mml:msup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> had been computed is valid when the learning rate <inline-formula id="ieqn-810"><mml:math id="mml-ieqn-810"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> does not depend on the gradient <inline-formula id="ieqn-811"><mml:math id="mml-ieqn-811"><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula> with respect to the whole set of network parameters <inline-formula id="ieqn-812"><mml:math id="mml-ieqn-812"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula>.</p>
<p>Otherwise, the update of the whole network parameter <inline-formula id="ieqn-813"><mml:math id="mml-ieqn-813"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> would be carried out after the complete gradient <inline-formula id="ieqn-814"><mml:math id="mml-ieqn-814"><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula> had been obtained, and the learning rate <inline-formula id="ieqn-815"><mml:math id="mml-ieqn-815"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> had been computed based on the gradient <inline-formula id="ieqn-816"><mml:math id="mml-ieqn-816"><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-120"><label>(120)</label><mml:math id="mml-eqn-120" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>:=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-817"><mml:math id="mml-ieqn-817"><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:math></inline-formula> is the total number of network paramters defined in Eq. (<xref ref-type="disp-formula" rid="eqn-34">34</xref>). An example of a learning-rate computation that depends on the complete gradient <inline-formula id="ieqn-818"><mml:math id="mml-ieqn-818"><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula> is gradient descent with Armijo line search; see Section <xref ref-type="sec" rid="s6_2_3">6.2.3</xref> and line <xref ref-type="fig" rid="fig-159">8</xref> in Algorithm <xref ref-type="fig" rid="fig-159">1</xref>.</p>
<p><disp-quote><p>&#x201C;Neural network researchers have long realized that the learning rate is reliably one of the most difficult to set hyperparameters because it significantly affects model performance.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 298.</p>
</disp-quote></p>
<p>In fact, it is well known in the field of optimization, where the learning rate is often mnemonically denoted by <inline-formula id="ieqn-819"><mml:math id="mml-ieqn-819"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula>, being Greek for &#x201C;l&#x201D; and standing for &#x201C;step length&#x201D;; see, e.g., Polak (1971) [<xref ref-type="bibr" rid="ref-139">139</xref>].</p>
<p><disp-quote><p>&#x201C;We can choose <inline-formula id="ieqn-820"><mml:math id="mml-ieqn-820"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> in several different ways. A popular approach is to set <inline-formula id="ieqn-821"><mml:math id="mml-ieqn-821"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> to a small constant. Sometimes, we can solve for the step size that makes the directional derivative vanish. Another approach is to evaluate <inline-formula id="ieqn-822"><mml:math id="mml-ieqn-822"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:msub><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for several values of <inline-formula id="ieqn-823"><mml:math id="mml-ieqn-823"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> and choose the one that results in the smallest objective function value. This last strategy is called a <bold>line search</bold>.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 82.</p></disp-quote></p>
<p>Choosing an arbitrarily small <inline-formula id="ieqn-824"><mml:math id="mml-ieqn-824"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>, without guidance on how small is small, is not a good approach, since exceedingly slow convergence could result for too small <inline-formula id="ieqn-825"><mml:math id="mml-ieqn-825"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>. In general, it would not be possible to solve for the step size to make the directional derivative vanish. There are several variants of line search for computing the step size to decrease the cost function, based on the &#x201C;decrease&#x201D; conditions,<xref ref-type="fn" rid="fn115"><sup>115</sup></xref><fn id="fn115"><label>115</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-140">140</xref>], [<xref ref-type="bibr" rid="ref-141">141</xref>].</p></fn> among which some are mentioned below.<xref ref-type="fn" rid="fn116"><sup>116</sup></xref><fn id="fn116"><label>116</label><p>See also [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 243.</p></fn></p><statement id="st6_4"><title>Remark 6.4.</title>
<p>Line search in deep-learning training. Line search methods are not only important for use in deterministic optimization with full batch of examples,<xref ref-type="fn" rid="fn117"><sup>117</sup></xref><fn id="fn117"><label>117</label><p>A full batch contains all examples in the training set. There is a confusion in the use of the word &#x201C;batch&#x201D; in terminologies such as &#x201C;batch optimization&#x201D; or &#x201C;batch gradient descent&#x201D;, which are used to mean the full training set, and not a subset of the training set; see, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 271. Hence we explicitly use &#x201C;full batch&#x201D; for full training set, and mini-batch for a small subset of the training set.</p></fn> but also in stochastic optimization (see Section <xref ref-type="sec" rid="s6_3">6.3</xref>) with random mini-batches of examples [<xref ref-type="bibr" rid="ref-80">80</xref>]. The difficulty of using stochastic gradient coming from random mini-batches is the presence of noise or &#x201C;discontinuities&#x201D;<xref ref-type="fn" rid="fn118"><sup>118</sup></xref><fn id="fn118"><label>118</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-80">80</xref>]. Noise is sometimes referred to as &#x201C;discontinuities&#x201D; as in [<xref ref-type="bibr" rid="ref-142">142</xref>]. See also the lecture video &#x201C;Understanding mini-batch gradient descent,&#x201D; at time 1:20, by Andrew Ng on Coursera <ext-link ext-link-type="uri" xlink:href="https://www.coursera.org/lecture/deep-neural-network/understanding-mini-batch-gradient-descent-lBXu8">website</ext-link>.</p></fn> in the cost function and in the gradient. Recent stochastic optimization methods&#x2014;such as the sub-sampled Hessian-free Newton method reviewed in [<xref ref-type="bibr" rid="ref-80">80</xref>], the probabilisitic line search in [<xref ref-type="bibr" rid="ref-143">143</xref>], the first-order stochastic Armijo line search in [<xref ref-type="bibr" rid="ref-144">144</xref>], the second-order sub-sampling line search method in [<xref ref-type="bibr" rid="ref-145">145</xref>], quasi-Newton method with probabilitic line search in [<xref ref-type="bibr" rid="ref-146">146</xref>], etc.&#x2014;where line search forms a key subprocedure, are designed to address or circumvent the noisy gradient problem. For this reason, claims that line search methods have &#x201C;fallen out of favor&#x201D;<xref ref-type="fn" rid="fn119"><sup>119</sup></xref><fn id="fn119"><label>119</label><p>In [<xref ref-type="bibr" rid="ref-78">78</xref>], a discussion on line search methods, however brief, was completely bypassed to focus on stochastic gradient-descent methods with learning-rate tuning and scheduling, such as AdaGrad, Adam, etc. Ironically, it is disconcerting to see these authors, who made important contributions to deep learning, thus helping thawing the last &#x201C;AI winter&#x201D;, regard with skepticism &#x201C;most guidance&#x201D; on learning-rate selection; see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 287, and Section <xref ref-type="sec" rid="s6_3">6.3</xref>. Since then, fully automatic stochastic line-search methods, without tuning runs, have been developed, apparently starting with [<xref ref-type="bibr" rid="ref-147">147</xref>]. In the abstract of [<xref ref-type="bibr" rid="ref-142">142</xref>], where an interesting method using only gradients, without function evaluations, was presented, one reads &#x201C;Due to discontinuities induced by mini-batch sampling, [line searches] have largely fallen out of favor&#x201D;.</p></fn> would be misleading, as they may encourage students not to learn the classics. A classic never dies; it just re-emerges in a different form with additional developments to tackle new problems.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>In view of <xref ref-type="statement" rid="st6_4">Remark 6.4</xref>, a goal of this section is to develop a feel for some classical deterministic line search methods for readers not familiar with these concepts to prepare for reading extensions of these methods to stochastic line search methods.</p>
<sec id="s6_2_1"><label>6.2.1</label>
<title>Exact line search</title>
<p>Find a positive step length <inline-formula id="ieqn-826"><mml:math id="mml-ieqn-826"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> that minimizes the cost function <inline-formula id="ieqn-827"><mml:math id="mml-ieqn-827"><mml:mi>J</mml:mi></mml:math></inline-formula> along the descent direction <inline-formula id="ieqn-828"><mml:math id="mml-ieqn-828"><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:math></inline-formula> such that the scalar (dot) product between <inline-formula id="ieqn-829"><mml:math id="mml-ieqn-829"><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-830"><mml:math id="mml-ieqn-830"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-121"><label>(121)</label><mml:math id="mml-eqn-121" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>&#x003C;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x003E;=</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>is negative, i.e., the descent direction <inline-formula id="ieqn-831"><mml:math id="mml-ieqn-831"><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:math></inline-formula> and the gradient <inline-formula id="ieqn-832"><mml:math id="mml-ieqn-832"><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:math></inline-formula> form an obtuse angle bounded away from <inline-formula id="ieqn-833"><mml:math id="mml-ieqn-833"><mml:mrow><mml:msup><mml:mn>90</mml:mn><mml:mo>&#x00B0;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula>,<xref ref-type="fn" rid="fn120"><sup>120</sup></xref><fn id="fn120"><label>120</label><p>Or equivalently, the descent direction <inline-formula id="ieqn-3126"><mml:math id="mml-ieqn-3126"><mml:mrow><mml:mi mathvariant='bold-italic'>d</mml:mi></mml:mrow></mml:math></inline-formula> forms an acute angle with the gradient (or steepest) descent direction <inline-formula id="ieqn-3127"><mml:math id="mml-ieqn-3127"><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, i.e., the negative of the gradient direction.</p></fn> and</p>
<p><disp-formula id="eqn-122"><label>(122)</label><mml:math id="mml-eqn-122" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>&#x03BB;</mml:mi></mml:munder><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The minimization problem in Eq. (<xref ref-type="disp-formula" rid="eqn-122">122</xref>) can be implemented using the Golden section search (or infinite Fibonacci search) for unimodal functions.<xref ref-type="fn" rid="fn121"><sup>121</sup></xref><fn id="fn121"><label>121</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 31, for the implementable algorithm, with the assumption that the cost function was convex. Convexity is, however, not needed for this algorithm to work; only unimodality is needed. A unimodal function has a unique minimum, and is decreasing on the left of the minimum, and increasing on the right of the minimum. Convex functions are necessarily unimodal, but not vice versa. Convexity is a particular case of unimodality. See also [<xref ref-type="bibr" rid="ref-148">148</xref>], p. 216, on Golden section search as infinite Fibonacci search and curve fitting line-search methods.</p></fn> For more general non-convex cost functions, a minimizing step length may be non-existent, or difficult to compute exactly.<xref ref-type="fn" rid="fn122"><sup>122</sup></xref><fn id="fn122"><label>122</label><p>See [<xref ref-type="bibr" rid="ref-149">149</xref>], p. 29, for this assertion, without examples. An example of non-existent minimizing step length <inline-formula id="ieqn-3128"><mml:math id="mml-ieqn-3128"><mml:mo>&#x2208;</mml:mo><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>&#x03BB;</mml:mi></mml:munder><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mi>d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> could be <inline-formula id="ieqn-3129"><mml:math id="mml-ieqn-3129"><mml:mi>f</mml:mi></mml:math></inline-formula> being a concave function. If we relax the continuity requirement, then it is easy to construct a function with no mininum and no maximum, e.g., <inline-formula id="ieqn-3130"><mml:math id="mml-ieqn-3130"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> on open intervals <inline-formula id="ieqn-3131"><mml:math id="mml-ieqn-3131"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x222A;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and <inline-formula id="ieqn-3132"><mml:math id="mml-ieqn-3132"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>; this function is &#x201C;essentially&#x201D; convex, except on the set <inline-formula id="ieqn-3133"></inline-formula> of measure zero where it is discontinuous. An example of a function whose minimum is difficult to compute exactly could be one with a minimum at the bottom of an extremely narrow crack.</p></fn> In addition, a line search for a minimizing step length is only an auxilliary step in an overall optimization algorithm. It is therefore sufficient to find an approximate step length satisfying some decrease conditions to ensure convergence to a local minimum, while keeping the step length from being too small that would hinder a reasonable advance toward such local minimum. For these reasons, inexact line search methods (rules) were introduced, first in [<xref ref-type="bibr" rid="ref-150">150</xref>], followed by [<xref ref-type="bibr" rid="ref-151">151</xref>], then [<xref ref-type="bibr" rid="ref-152">152</xref>] and [<xref ref-type="bibr" rid="ref-153">153</xref>]. In view of <xref ref-type="statement" rid="st6_4">Remark 6.4</xref> and Footnote <xref ref-type="fn" rid="fn119">119</xref>, as we present these deterministic line-search rules, we will also immediately recall, where applicable, the recent references that generalize these rules by adding stochasticity for use as a subprocedure (inner loop) for the stochastic gradient-descent (SGD) algorithm.</p></sec>
<sec id="s6_2_2"><label>6.2.2</label>
<title>Inexact line-search, Goldstein&#x2019;s rule</title>
<p>The method is inexact since the search for an acceptable step length would stop before a minimum is reached, once the rule is satisfied.<xref ref-type="fn" rid="fn123"><sup>123</sup></xref><fn id="fn123"><label>123</label><p>The book [<xref ref-type="bibr" rid="ref-154">154</xref>] was cited in [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 33, but not the papers [<xref ref-type="bibr" rid="ref-150">150</xref>] and [<xref ref-type="bibr" rid="ref-154">154</xref>]b, where Goldstein&#x2019;s rule was explicitly presented in the form: Step length <inline-formula id="ieqn-3134"><mml:math id="mml-ieqn-3134"><mml:mo>&#x03B3;</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mi>&#x03C6;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>&#x03C6;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, with &#x03C6; being the descent direction, <inline-formula id="ieqn-3136"><mml:math id="mml-ieqn-3136"><mml:mo stretchy="false">[</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> the scalar (dot) product between vector <inline-formula id="ieqn-3137"><mml:math id="mml-ieqn-3137"><mml:mi>a</mml:mi></mml:math></inline-formula> and vector <inline-formula id="ieqn-3138"><mml:math id="mml-ieqn-3138"><mml:mi>b</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-3139"><mml:math id="mml-ieqn-3139"><mml:mo>&#x03B4;</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula>, if <inline-formula id="ieqn-3140"><mml:math id="mml-ieqn-3140"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula>. The same relation was given in [<xref ref-type="bibr" rid="ref-150">150</xref>], with different notation.</p></fn> For a fixed constant <inline-formula id="ieqn-834"><mml:math id="mml-ieqn-834"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="true">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo stretchy="true">)</mml:mo></mml:math></inline-formula>, select a learning rate (step length) <inline-formula id="ieqn-835"><mml:math id="mml-ieqn-835"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> such that<xref ref-type="fn" rid="fn124"><sup>124</sup></xref><fn id="fn124"><label>124</label><p>See also [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 33.</p></fn></p>
<p><disp-formula id="eqn-123"><label>(123)</label><mml:math id="mml-eqn-123" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mfrac><mml:mrow><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x2264;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where both the numerator and the denominator are negative, i.e., <inline-formula id="ieqn-836"><mml:math id="mml-ieqn-836"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-837"><mml:math id="mml-ieqn-837"><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> by Eq. (<xref ref-type="disp-formula" rid="eqn-121">121</xref>). Eq. (<xref ref-type="disp-formula" rid="eqn-123">123</xref>) can be recast into a slightly more general form: For <inline-formula id="ieqn-838"><mml:math id="mml-ieqn-838"><mml:mn>0</mml:mn><mml:mo>&#x003C;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, choose a learning rate <inline-formula id="ieqn-839"><mml:math id="mml-ieqn-839"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> such that<xref ref-type="fn" rid="fn125"><sup>125</sup></xref><fn id="fn125"><label>125</label><p>See [<xref ref-type="bibr" rid="ref-149">149</xref>], p. 55, and [<xref ref-type="bibr" rid="ref-156">156</xref>], p. 256, where the equality <inline-formula id="ieqn-3141"><mml:math id="mml-ieqn-3141"><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>p</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> was even allowed, provided the step length <inline-formula id="ieqn-3142"><mml:math id="mml-ieqn-3142"><mml:mo>&#x2208;</mml:mo></mml:math></inline-formula> satisfied the equality <inline-formula id="ieqn-3143"><mml:math id="mml-ieqn-3143"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula>. But that would make the computation unnecessarily stringent and costly, since the step length <inline-formula id="ieqn-3144"><mml:math id="mml-ieqn-3144"><mml:mo>&#x2208;</mml:mo></mml:math></inline-formula> as the root of this equation has to be solved for accurately. Again, the idea should be to make the sector bounded by the two lines, <inline-formula id="ieqn-3145"><mml:math id="mml-ieqn-3145"><mml:mi>b</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> from below and the line <inline-formula id="ieqn-3146"><mml:math id="mml-ieqn-3146"><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>p</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> from above, as large as possible in inexact line search. See discussion below Eq. (<xref ref-type="disp-formula" rid="eqn-124">124</xref>).</p></fn></p>
<p>
<disp-formula id="eqn-124"><label>(124)</label><mml:math id="mml-eqn-124" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B2;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>A reason could be that the sector bounded by the two lines <inline-formula id="ieqn-840"><mml:math id="mml-ieqn-840"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-841"><mml:math id="mml-ieqn-841"><mml:mi>&#x03B1;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> may be too narrow when <inline-formula id="ieqn-842"><mml:math id="mml-ieqn-842"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is close to 0.5 from below, making <inline-formula id="ieqn-843"><mml:math id="mml-ieqn-843"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> also close to 0.5 from above. For example, it was recommended in [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 33 and p. 37, to use <inline-formula id="ieqn-844"><mml:math id="mml-ieqn-844"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.4</mml:mn></mml:math></inline-formula>, and hence <inline-formula id="ieqn-845"><mml:math id="mml-ieqn-845"><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.6</mml:mn></mml:math></inline-formula>, making a tight sector, but we could enlarge such sector by choosing <inline-formula id="ieqn-846"><mml:math id="mml-ieqn-846"><mml:mn>0.6</mml:mn><mml:mo>&#x003C;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>.</p>
<fig id="fig-62">
<label>Figure 62</label>
<caption><title><italic>Inexact line search, Goldstein&#x2019;s rule</italic> (Section <xref ref-type="sec" rid="s6_2_4">6.2.4</xref>). acceptable step lengths would be such that a decrease in the cost function <inline-formula id="ieqn-673"><mml:math id="mml-ieqn-673"><mml:mi>J</mml:mi></mml:math></inline-formula>, denoted by <inline-formula id="ieqn-674"><mml:math id="mml-ieqn-674"><mml:mo>&#x0394;</mml:mo><mml:mi>J</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-124">124</xref>), falls into an acceptable sector formed by an upper-bound line and a lower-bound line. the upper bound is given by the straight line <inline-formula id="ieqn-675"><mml:math id="mml-ieqn-675"><mml:mi>&#x03B1;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> (green), with fixed constant <inline-formula id="ieqn-676"><mml:math id="mml-ieqn-676"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-677"><mml:math id="mml-ieqn-677"><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> being the slope to the curve <inline-formula id="ieqn-678"><mml:math id="mml-ieqn-678"><mml:mo>&#x0394;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at <inline-formula id="ieqn-679"><mml:math id="mml-ieqn-679"><mml:mi>&#x03F5;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>. The lower-bound line <inline-formula id="ieqn-680"><mml:math id="mml-ieqn-680"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> (black), adopted in [<xref ref-type="bibr" rid="ref-150">150</xref>] and [<xref ref-type="bibr" rid="ref-155">155</xref>], would be too narrow when <inline-formula id="ieqn-681"><mml:math id="mml-ieqn-681"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is close to <inline-formula id="ieqn-682"><mml:math id="mml-ieqn-682"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula>, leaving all local minimizers such as <inline-formula id="ieqn-683"><mml:math id="mml-ieqn-683"><mml:msubsup><mml:mi>&#x03F5;</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-684"><mml:math id="mml-ieqn-684"><mml:msubsup><mml:mi>&#x03F5;</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula> outside of the acceptable intevals <inline-formula id="ieqn-685"><mml:math id="mml-ieqn-685"><mml:msubsup><mml:mi>I</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-686"><mml:math id="mml-ieqn-686"><mml:msubsup><mml:mi>I</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, and <inline-formula id="ieqn-687"><mml:math id="mml-ieqn-687"><mml:msubsup><mml:mi>I</mml:mi><mml:mn>3</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> (black), which are themselves narrow. The lower-bound line <inline-formula id="ieqn-688"><mml:math id="mml-ieqn-688"><mml:mi>&#x03B2;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> (purple) proposed in [<xref ref-type="bibr" rid="ref-156">156</xref>], p. 256, and [<xref ref-type="bibr" rid="ref-149">149</xref>], p. 55, with <inline-formula id="ieqn-689"><mml:math id="mml-ieqn-689"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003C;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, would enlarge the acceptable sector, which then may contain the minimizers inside the corresponding acceptable intervals <inline-formula id="ieqn-690"><mml:math id="mml-ieqn-690"><mml:msubsup><mml:mi>I</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-691"><mml:math id="mml-ieqn-691"><mml:msubsup><mml:mi>I</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> (purple).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-62.tif"/>
</fig>
<p>The search for an appropriate step length that satisfies Eq. (<xref ref-type="disp-formula" rid="eqn-123">123</xref>) or Eq. (<xref ref-type="disp-formula" rid="eqn-124">124</xref>) could be carried out by a subprocedure based on, e.g., the bisection method, as suggested in [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 33. Goldstein&#x2019;rule&#x2014;also designated as <italic>Goldstein principle</italic> in the classic book [<xref ref-type="bibr" rid="ref-156">156</xref>], p. 256, since it ensured a decrease in the cost function&#x2014;has been &#x201C;used only occasionally&#x201D; per Polak (1997) [<xref ref-type="bibr" rid="ref-149">149</xref>], p. 55, largely superceded by Armijo&#x2019;s rule, and has not been generalized to add stochasticity. On the other hand, the idea behind Armijo&#x2019;s rule is similar to Goldstein&#x2019;s rule, but with a convenient subprocedure<xref ref-type="fn" rid="fn126"><sup>126</sup></xref><fn id="fn126"><label>126</label><p>See [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 36, Algorithm 36.</p></fn> to find the appropriate step length.</p></sec>
<sec id="s6_2_3"><label>6.2.3</label>
<title>Inexact line-search, Armijo&#x2019;s rule</title>
<p>Apparently without the knowledge of [<xref ref-type="bibr" rid="ref-150">150</xref>], it was proposed in [<xref ref-type="bibr" rid="ref-151">151</xref>] the following highly popular Armijo step-length search,<xref ref-type="fn" rid="fn127"><sup>127</sup></xref><fn id="fn127"><label>127</label><p>As of 2022.07.09, [<xref ref-type="bibr" rid="ref-151">151</xref>] was cited 2301 times in various publications (books, papers) according to Google Scholar, and 1028 times in archival journal papers according to Web of Science. There are references that mention the name Armijo, but without referring to the original paper [<xref ref-type="bibr" rid="ref-151">151</xref>], such as [<xref ref-type="bibr" rid="ref-157">157</xref>], clearly indicating that Armijo&#x2019;s rule is a classic, just like there is no need to refer to Newton&#x2019;s original work for Newton&#x2019;s method.</p></fn> which recently forms the basis for stochastic line search for use in stochastic gradient-descent algorithm described in Section <xref ref-type="sec" rid="s6_3">6.3</xref>: Stochasticity was added to Armijo&#x2019;s rule in [<xref ref-type="bibr" rid="ref-144">144</xref>], and the concept was extended to second-order line search [<xref ref-type="bibr" rid="ref-145">145</xref>]. Line search based on Armijo&#x2019;s rule is also applied to quasi-Newton method for noisy functions in [<xref ref-type="bibr" rid="ref-158">158</xref>], and to exact and inexact subsampled Newton methods in [<xref ref-type="bibr" rid="ref-159">159</xref>].<xref ref-type="fn" rid="fn128"><sup>128</sup></xref><fn id="fn128"><label>128</label><p>All of these stochastic optimization methods are considered as part of a broader class known as derivative-free optimization methods [<xref ref-type="bibr" rid="ref-160">160</xref>].</p></fn></p>
<p>Armijo&#x2019;s rule is stated as follows: For <inline-formula id="ieqn-847"><mml:math id="mml-ieqn-847"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-848"><mml:math id="mml-ieqn-848"><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and <inline-formula id="ieqn-849"><mml:math id="mml-ieqn-849"><mml:mi>&#x03C1;</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, use the step length <inline-formula id="ieqn-850"><mml:math id="mml-ieqn-850"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> such that:<xref ref-type="fn" rid="fn129"><sup>129</sup></xref><fn id="fn129"><label>129</label><p>[<xref ref-type="bibr" rid="ref-156">156</xref>], p. 491, called the constructive technique in Eq. (<xref ref-type="disp-formula" rid="eqn-125">125</xref>) to obtain the step length <inline-formula id="ieqn-3147"><mml:math id="mml-ieqn-3147"><mml:mo>&#x2208;</mml:mo></mml:math></inline-formula> the Goldstein-Armijo algorithm, since [<xref ref-type="bibr" rid="ref-150">150</xref>] and [<xref ref-type="bibr" rid="ref-154">154</xref>]b did not propose a method to solve for the step length, while [<xref ref-type="bibr" rid="ref-151">151</xref>] did. See also below Eq. (<xref ref-type="disp-formula" rid="eqn-124">124</xref>) where it was mentioned that a bisection method can be used with Goldstein&#x2019;s rule.</p></fn></p>
<p><disp-formula id="eqn-125"><label>(125)</label><mml:math id="mml-eqn-125" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>a</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mi>&#x03C1;</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>j</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>a</mml:mi></mml:msup></mml:mrow><mml:mi>&#x03C1;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the decrease in the cost function along the descent direction <inline-formula id="ieqn-851"><mml:math id="mml-ieqn-851"><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:math></inline-formula>, denoted by <inline-formula id="ieqn-852"><mml:math id="mml-ieqn-852"><mml:mtext>&#x0394;</mml:mtext><mml:mi>J</mml:mi></mml:math></inline-formula>, was defined in Eq. (<xref ref-type="disp-formula" rid="eqn-124">124</xref>), and the descent direction <inline-formula id="ieqn-853"><mml:math id="mml-ieqn-853"><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:math></inline-formula> is related to the gradient <inline-formula id="ieqn-854"><mml:math id="mml-ieqn-854"><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:math></inline-formula> via Eq. (<xref ref-type="disp-formula" rid="eqn-121">121</xref>). The Armijo condition in Eq. (<xref ref-type="disp-formula" rid="eqn-125">125</xref>) can be rewritten as</p>
<p><disp-formula id="eqn-126"><label>(126)</label><mml:math id="mml-eqn-126" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is also known as the Armijo sufficient decrease condition, the first of the two Wolfe conditions presented below; see [<xref ref-type="bibr" rid="ref-152">152</xref>], [<xref ref-type="bibr" rid="ref-149">149</xref>], p. 55.<xref ref-type="fn" rid="fn130"><sup>130</sup></xref><fn id="fn130"><label>130</label><p>See also [<xref ref-type="bibr" rid="ref-157">157</xref>], p. 34, [<xref ref-type="bibr" rid="ref-148">148</xref>], p. 230.</p></fn></p>
<p>Regarding the paramters <inline-formula id="ieqn-855"><mml:math id="mml-ieqn-855"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-856"><mml:math id="mml-ieqn-856"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-857"><mml:math id="mml-ieqn-857"><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula> in the Armijo&#x2019;s rule Eq. (<xref ref-type="disp-formula" rid="eqn-125">125</xref>), [<xref ref-type="bibr" rid="ref-151">151</xref>] selected to fix</p>
<p><disp-formula id="eqn-127"><label>(127)</label><mml:math id="mml-eqn-127" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03C1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and proved a convergence theorem. In practice, <inline-formula id="ieqn-858"><mml:math id="mml-ieqn-858"><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula> cannot be arbitrarily large. Polak (1971) [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 36, also fixed <inline-formula id="ieqn-859"><mml:math id="mml-ieqn-859"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula>, but recommended to select <inline-formula id="ieqn-860"><mml:math id="mml-ieqn-860"><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0.5</mml:mn><mml:mo>,</mml:mo><mml:mn>0.8</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, based on numerical experiments.<xref ref-type="fn" rid="fn131"><sup>131</sup></xref><fn id="fn131"><label>131</label><p>See [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 301.</p></fn>, and to select<xref ref-type="fn" rid="fn132"><sup>132</sup></xref><fn id="fn132"><label>132</label><p>To satisfy the condition in Eq. (<xref ref-type="disp-formula" rid="eqn-121">121</xref>), the descent direction <inline-formula id="ieqn-3148"><mml:math id="mml-ieqn-3148"><mml:mrow><mml:mi mathvariant='bold-italic'>d</mml:mi></mml:mrow></mml:math></inline-formula> is required to satisfy <inline-formula id="ieqn-3149"><mml:math id="mml-ieqn-3149"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x2225;&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo stretchy="false">&#x2225;</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-3150"><mml:math id="mml-ieqn-3150"><mml:mi>r</mml:mi><mml:mi>h</mml:mi><mml:mi>o</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, so to form an obtuse angle, bounded away from <inline-formula id="ieqn-3151"></inline-formula>, with the gradient <inline-formula id="ieqn-3152"><mml:math id="mml-ieqn-3152"><mml:mrow><mml:mi mathvariant='bold-italic'>g</mml:mi></mml:mrow></mml:math></inline-formula>. But <inline-formula id="ieqn-3153"><mml:math id="mml-ieqn-3153"><mml:mi>r</mml:mi><mml:mi>h</mml:mi><mml:mi>o</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> would violate the Schwarz inequality, which requires <inline-formula id="ieqn-3154"><mml:math id="mml-ieqn-3154"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x2264;&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x2225;&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo stretchy="false">&#x2225;</mml:mo></mml:math></inline-formula>.</p></fn> <inline-formula id="ieqn-861"><mml:math id="mml-ieqn-861"><mml:mi>&#x03C1;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> to minimize the rate <inline-formula id="ieqn-862"><mml:math id="mml-ieqn-862"><mml:mi>r</mml:mi></mml:math></inline-formula> of geometric progression (from the iterate <inline-formula id="ieqn-863"><mml:math id="mml-ieqn-863"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, for <inline-formula id="ieqn-864"><mml:math id="mml-ieqn-864"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo></mml:math></inline-formula>, toward the local minimizer <inline-formula id="ieqn-865"><mml:math id="mml-ieqn-865"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula>) for linear convergence:<xref ref-type="fn" rid="fn133"><sup>133</sup></xref><fn id="fn133"><label>133</label><p>The inequality in Eq. (<xref ref-type="disp-formula" rid="eqn-128">128</xref>) leads to linear convergence in the sense that <inline-formula id="ieqn-3155"><mml:math id="mml-ieqn-3155"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula>, for <inline-formula id="ieqn-3156"><mml:math id="mml-ieqn-3156"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-3157"><mml:math id="mml-ieqn-3157"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> being a constant. See [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 245.</p></fn></p>
<p><disp-formula id="eqn-128"><label>(128)</label><mml:math id="mml-eqn-128" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:msup><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03C1;</mml:mi><mml:mi>m</mml:mi></mml:mrow><mml:mi>M</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-866"><mml:math id="mml-ieqn-866"><mml:mi>m</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-867"><mml:math id="mml-ieqn-867"><mml:mi>M</mml:mi></mml:math></inline-formula> are the lower and upper bounds of the eigenvalues<xref ref-type="fn" rid="fn134"><sup>134</sup></xref><fn id="fn134"><label>134</label><p>A narrow valley with the minimizer <inline-formula id="ieqn-3158"><mml:math id="mml-ieqn-3158"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>&#x03B8;</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> at the bottom would have a very small ratio <inline-formula id="ieqn-3159"><mml:math id="mml-ieqn-3159"><mml:mi>m</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>M</mml:mi></mml:math></inline-formula>. See also the use of &#x201C;small heavy sphere&#x201D; (also known as &#x201C;heavy ball&#x201D;) method to accelerate convergence in the case of narrow valley in Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref> on stochastic gradient descent with momentum.</p></fn> of the Hessian <inline-formula id="ieqn-868"><mml:math id="mml-ieqn-868"><mml:msup><mml:mo>&#x2207;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mi>J</mml:mi></mml:math></inline-formula>, thus <inline-formula id="ieqn-869"><mml:math id="mml-ieqn-869"><mml:mrow><mml:mfrac><mml:mi>m</mml:mi><mml:mi>M</mml:mi></mml:mfrac></mml:mrow><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. In summary, [<xref ref-type="bibr" rid="ref-139">139</xref>] recommended:</p>
<p><disp-formula id="eqn-129"><label>(129)</label><mml:math id="mml-eqn-129" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0.5</mml:mn><mml:mo>,</mml:mo><mml:mn>0.8</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03C1;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The pseudocode for deterministic gradient descent with Armijo line search is Algorithm <xref ref-type="fig" rid="fig-160">2</xref>, and the pseudocode for deterministic quasi-Newton / Newton with Armijo line search is Algorithm <xref ref-type="fig" rid="fig-161">3</xref>. When the Hessian <inline-formula id="ieqn-870"><mml:math id="mml-ieqn-870"><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> is positive definite, the Newton descent direction is:</p>
<p><disp-formula id="eqn-130"><label>(130)</label><mml:math id="mml-eqn-130" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-160">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-160.tif"/>
</fig>
<fig id="fig-161">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-161.tif"/>
</fig>
<p>When the Hessian <inline-formula id="ieqn-871"><mml:math id="mml-ieqn-871"><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is not positive definite, e.g., near a saddle point, then quasi-Newton method uses the gradient descent direction <inline-formula id="ieqn-872"><mml:math id="mml-ieqn-872"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as the descent direction <inline-formula id="ieqn-873"><mml:math id="mml-ieqn-873"><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-120">120</xref>),</p>
<p><disp-formula id="eqn-131"><label>(131)</label><mml:math id="mml-eqn-131" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and regularized Newton method uses a descent direction based on a regularized Hessian of the form:</p>
<p><disp-formula id="eqn-132"><label>(132)</label><mml:math id="mml-eqn-132" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-874"><mml:math id="mml-ieqn-874"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> is a small perturbation parameter (line <xref ref-type="fig" rid="fig-161">15</xref> in Algorithm <xref ref-type="fig" rid="fig-161">3</xref> for deterministic Newton and line <xref ref-type="fig" rid="fig-165">17</xref> in Algorithm <xref ref-type="fig" rid="fig-165">7</xref> for stochastic Newton).<xref ref-type="fn" rid="fn135"><sup>135</sup></xref><fn id="fn135"><label>135</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-149">149</xref>], p. 35, and [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 302, where both cited the Levenberg-Marquardt algorithm as the first to use regularized Hessian.</p></fn></p></sec>
<sec id="s6_2_4"><label>6.2.4</label>
<title>Inexact line-search, Wolfe&#x2019;s rule</title>
<p>The rule introduced in [<xref ref-type="bibr" rid="ref-152">152</xref>] and [<xref ref-type="bibr" rid="ref-153">153</xref>],<xref ref-type="fn" rid="fn136"><sup>136</sup></xref><fn id="fn136"><label>136</label><p>As of 2022.07.09, [<xref ref-type="bibr" rid="ref-152">152</xref>] was cited 1336 times in various publications (books, papers) according to Google Scholar, and 559 times in archival journal papers according to Web of Science.</p></fn> sometimes called the Armijo-Goldstein-Wolfe&#x2019;s rule (or conditions), particularly in [<xref ref-type="bibr" rid="ref-140">140</xref>] and [<xref ref-type="bibr" rid="ref-141">141</xref>],<xref ref-type="fn" rid="fn137"><sup>137</sup></xref><fn id="fn137"><label>137</label><p>The authors of [<xref ref-type="bibr" rid="ref-140">140</xref>] and [<xref ref-type="bibr" rid="ref-141">141</xref>] may not be aware that Goldstein&#x2019;s rule appeared before Armijo&#x2019;s rule, as they cited Goldstein&#x2019;s 1967 book [<xref ref-type="bibr" rid="ref-154">154</xref>], instead of Goldstein&#x2019;s 1965 paper [<xref ref-type="bibr" rid="ref-150">150</xref>], and referred often to Polak (1971) [<xref ref-type="bibr" rid="ref-139">139</xref>], even though it was written in [<xref ref-type="bibr" rid="ref-139">139</xref>], p. 32, that a &#x201C;step size rule [Eq. (<xref ref-type="disp-formula" rid="eqn-124">124</xref>)] probably first introduced by Goldstein (1967) [<xref ref-type="bibr" rid="ref-154">154</xref>]&#x201D; was used in an algorithm. See also Footnote <xref ref-type="fn" rid="fn123">123</xref>.</p></fn> has been extended to add stochasticity [<xref ref-type="bibr" rid="ref-143">143</xref>],<xref ref-type="fn" rid="fn138"><sup>138</sup></xref><fn id="fn138"><label>138</label><p>An earlier version of the 2017 paper [<xref ref-type="bibr" rid="ref-143">143</xref>] is the 2015 preprint [<xref ref-type="bibr" rid="ref-147">147</xref>].</p></fn> is stated as follows: For <inline-formula id="ieqn-875"><mml:math id="mml-ieqn-875"><mml:mn>0</mml:mn><mml:mo>&#x003C;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, select the step length (learning rate) <inline-formula id="ieqn-876"><mml:math id="mml-ieqn-876"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> such that (see, e.g., [<xref ref-type="bibr" rid="ref-149">149</xref>], p. 55):</p>
<p><disp-formula id="eqn-133"><label>(133)</label><mml:math id="mml-eqn-133" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-134"><label>(134)</label><mml:math id="mml-eqn-134" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The first Wolfe&#x2019;s rule in Eq. (<xref ref-type="disp-formula" rid="eqn-133">133</xref>) is the same as the Armijo&#x2019;s rule in Eq. (<xref ref-type="disp-formula" rid="eqn-126">126</xref>), which ensures that at the updated point <inline-formula id="ieqn-877"><mml:math id="mml-ieqn-877"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the cost function value <inline-formula id="ieqn-878"><mml:math id="mml-ieqn-878"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is below the green line <inline-formula id="ieqn-879"><mml:math id="mml-ieqn-879"><mml:mi>&#x03B1;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-62">62</xref>.</p>
<p>The second Wolfe&#x2019;s rule in Eq. (<xref ref-type="disp-formula" rid="eqn-134">134</xref>) is to ensure that at the updated point <inline-formula id="ieqn-880"><mml:math id="mml-ieqn-880"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the slope of the cost function cannot fall below the (negative) slope of the purple line <inline-formula id="ieqn-881"><mml:math id="mml-ieqn-881"><mml:mi>&#x03B2;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">d</mml:mi></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-62">62</xref>.</p>
<p>For other variants of line search, we refer to [<xref ref-type="bibr" rid="ref-161">161</xref>].</p></sec></sec>
<sec id="s6_3"><label>6.3</label>
<title>Stochastic gradient-descent (1st-order) methods</title>
<p>To avoid confusion,<xref ref-type="fn" rid="fn139"><sup>139</sup></xref><fn id="fn139"><label>139</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 271, about this terminology confusion. The authors of [<xref ref-type="bibr" rid="ref-80">80</xref>] used &#x201C;stochastic&#x201D; optimization to mean optimization using random &#x201C;minibatches&#x201D; of examples, and &#x201C;batch&#x201D; optimization to mean optimization using &#x201C;full batch&#x201D; or full training set of examples.</p></fn> we will use the terminology &#x201C;full batch&#x201D; (instead of just &#x201C;batch&#x201D;) when the entire training set is used for training. a minibatch is a small subset of the training set.</p>
<p>In fact, as we shall see, and as mentioned in <xref ref-type="statement" rid="st6_4">Remark 6.4</xref>, classical optimization methods mentioned in Section <xref ref-type="sec" rid="s6_2">6.2</xref> have been developed further to tackle new problems, such as noisy gradients, encountered in deep-learning training with random mini-batches. There is indeed much room for new research on learning rate since:</p>
<disp-quote><p>&#x201C;The learning rate may be chosen by trial and error. This is more of an art than a science, and most guidance on this subject should be regarded with some skepticism.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 287.</p>
</disp-quote><p>At the time of this writing, we are aware of two review papers on optimization algorithms for machine learning, and in particular deep learning, aiming particularly at experts in the field: [<xref ref-type="bibr" rid="ref-80">80</xref>], as mentioned above, and [<xref ref-type="bibr" rid="ref-162">162</xref>]. Our review complements these two review papers. We are aiming here at bringing first-time learners up to speed to benefit from, and even to hopefully enjoy, reading these and others related papers. To this end, we deliberately avoid the dense mathematical-programming language, not familiar to readers outside the field, as used in [<xref ref-type="bibr" rid="ref-80">80</xref>], while providing more details on algorithms that have proved important in deep learning than [<xref ref-type="bibr" rid="ref-162">162</xref>].</p>
<p>Listed below are the points that distinguish the present paper from other reviews. Similar to [<xref ref-type="bibr" rid="ref-78">78</xref>], both [<xref ref-type="bibr" rid="ref-80">80</xref>] and [<xref ref-type="bibr" rid="ref-162">162</xref>]:
<list list-type="bullet">
<list-item><p>Only mentioned briefly in words the connection between SGD with momentum to mechanics without detailed explanation using the equation of motion of the &#x201C;heavy ball&#x201D;, a name not as accurate as the original name &#x201C;small heavy sphere&#x201D; by Polyak (1964) [<xref ref-type="bibr" rid="ref-3">3</xref>]. These references also did not explain how such motion help to accelerate convergence; see Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref>.</p></list-item>
<list-item><p>Did not discuss recent practical add-on improvements to SGD such as step-length tuning (Section <xref ref-type="sec" rid="s6_3_3">6.3.3</xref>) and step-length decay (Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref>), as proposed in [<xref ref-type="bibr" rid="ref-55">55</xref>]. This information would be useful for first-time learners.</p></list-item>
<list-item><p>Did not connect step-length decay to simulated annealing, and did not explain the reason for using the name &#x201C;annealing&#x201D;<xref ref-type="fn" rid="fn140"><sup>140</sup></xref><fn id="fn140"><label>140</label><p>The authors of [<xref ref-type="bibr" rid="ref-162">162</xref>] only cited [<xref ref-type="bibr" rid="ref-163">163</xref>] for a brief mention of &#x201C;simulated annealing&#x201D; as an example of &#x201C;heuristic optimizers&#x201D;, with no discussion, and no connection to step length decay. See also <xref ref-type="statement" rid="st6_10">Remark 6.10</xref> on &#x201C;Metaheuristics&#x201D;.</p></fn> in deep learning by connecting to stochastic differential equation and physics; see <xref ref-type="statement" rid="st6_9">Remark 6.9</xref> in Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>.</p></list-item>
<list-item><p>Did not review an alternative to step-length decay by increasing minibatch size, which could be more efficient, as proposed in [<xref ref-type="bibr" rid="ref-164">164</xref>]; see Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>.</p></list-item>
<list-item><p>Did not point out that the exponential smoothing method (or running average) used in adaptive learning-rate algorithms dated since the 1950s in the field of forecasting. None of these references acknowledged the contributions made in [<xref ref-type="bibr" rid="ref-165">165</xref>] and [<xref ref-type="bibr" rid="ref-166">166</xref>], in which exponential smoothing from time series in forecasting was probably first brought to machine learning. See Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref>.</p></list-item>
<list-item><p>Did not discuss recent adaptive learning-rate algorithms such as <xref ref-type="sec" rid="s6_5_10">AdamW</xref> [<xref ref-type="bibr" rid="ref-56">56</xref>].<xref ref-type="fn" rid="fn141"><sup>141</sup></xref><fn id="fn141"><label>141</label><p>The authors of [<xref ref-type="bibr" rid="ref-162">162</xref>] only cited [<xref ref-type="bibr" rid="ref-56">56</xref>] in passing, without reviewing <xref ref-type="sec" rid="s6_5_10">AdamW</xref>, which was not even mentioned.</p></fn> These authors also did not discuss the criticism of adaptive methods in [<xref ref-type="bibr" rid="ref-55">55</xref>]; see Section <xref ref-type="sec" rid="s6_5_10">6.5.10</xref>.</p></list-item>
<list-item><p>Did not discuss classical line-search rules&#x2014;such as [<xref ref-type="bibr" rid="ref-150">150</xref>], [<xref ref-type="bibr" rid="ref-151">151</xref>],<xref ref-type="fn" rid="fn142"><sup>142</sup></xref><fn id="fn142"><label>142</label><p>The authors of [<xref ref-type="bibr" rid="ref-80">80</xref>] only cited Armijo (1966) [<xref ref-type="bibr" rid="ref-151">151</xref>] once for a pseudocode using line search.</p></fn> [<xref ref-type="bibr" rid="ref-152">152</xref>] (Sections <xref ref-type="sec" rid="s6_2_2">6.2.2</xref>, <xref ref-type="sec" rid="s6_2_3">6.2.3</xref>, <xref ref-type="sec" rid="s6_2_4">6.2.4</xref>)&#x2014;that have been recently generalized to add stochasticity, e.g., [<xref ref-type="bibr" rid="ref-143">143</xref>], [<xref ref-type="bibr" rid="ref-144">144</xref>], [<xref ref-type="bibr" rid="ref-145">145</xref>]; see Sections <xref ref-type="sec" rid="s6_6">6.6</xref>, <xref ref-type="sec" rid="s6_7">6.7</xref>.</p></list-item></list></p>
<sec id="s6_3_1"><label>6.3.1</label>
<title>Standard SGD, minibatch, fixed learning-rate schedule</title>
<p>The stochastic gradient descent algorithm, originally introduced by Robbins &amp; Monro (1951a) [<xref ref-type="bibr" rid="ref-167">167</xref>] (another classic) according to many sources,<xref ref-type="fn" rid="fn143"><sup>143</sup></xref><fn id="fn143"><label>143</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-80">80</xref>]&#x2013;in which there was a short bio of Robbins, the first author of [<xref ref-type="bibr" rid="ref-167">167</xref>]&#x2013;and [<xref ref-type="bibr" rid="ref-162">162</xref>] [<xref ref-type="bibr" rid="ref-144">144</xref>].</p></fn> has been playing an important role in training deep-learning networks:</p>
<disp-quote><p>&#x201C;Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent (SGD). Stochastic gradient descent is an extension of the gradient descent algorithm.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 147.</p>
</disp-quote><p><bold>Minibatch.</bold> The number <inline-formula id="ieqn-882"><mml:math id="mml-ieqn-882"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula> of examples in a training set <inline-formula id="ieqn-883"><mml:math id="mml-ieqn-883"><mml:mrow><mml:mtext>&#x1D569;</mml:mtext></mml:mrow></mml:math></inline-formula> could be very large, rendering prohibitively expensive to evaluate the cost function and to compute the gradient of the cost function with respect to the number of parameters, which by itself could also be very large. At iteration <inline-formula id="ieqn-884"><mml:math id="mml-ieqn-884"><mml:mi>k</mml:mi></mml:math></inline-formula> within a training session <inline-formula id="ieqn-885"><mml:math id="mml-ieqn-885"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>, let <inline-formula id="ieqn-886"><mml:math id="mml-ieqn-886"><mml:msubsup><mml:mrow><mml:mtext>&#x1D540;</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> be a randomly selected set of <inline-formula id="ieqn-887"><mml:math id="mml-ieqn-887"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> indices, which are elements of the training-set indices <inline-formula id="ieqn-888"><mml:math id="mml-ieqn-888"><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. Typically, <inline-formula id="ieqn-889"><mml:math id="mml-ieqn-889"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> is much smaller than <inline-formula id="ieqn-890"><mml:math id="mml-ieqn-890"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula>:<xref ref-type="fn" rid="fn144"><sup>144</sup></xref><fn id="fn144"><label>144</label><p>As of 2010.04.30, the ImageNet database contained more than 14 million images; see <ext-link ext-link-type="uri" xlink:href="http://www.image-net.org/about-stats">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20191002051313/http://www.image-net.org/about-stats">Internet archive</ext-link>, Figure <xref ref-type="fig" rid="fig-3">3</xref> and Footnote <xref ref-type="fn" rid="fn14">14</xref>. There is a slight inconsistency in notation in [<xref ref-type="bibr" rid="ref-78">78</xref>], where on p. 148, &#x1D4C2; and &#x1D4C2;<sup>&#x0027;</sup> denote the number of examples in the training set and in the minibatch, respectively, whereas on p. 274, &#x1D4C2; denote the number of examples in a minibatch. In our notation, &#x1D4C2; is the dimension of the output array &#x1D4CE;, whereas &#x1D5C6; (in a different font) is the minibatch size; see Footnote <xref ref-type="fn" rid="fn88">88</xref>. In theory, we write <inline-formula id="ieqn-3166"><mml:math id="mml-ieqn-3166"><mml:mtext>&#x1D5C6;</mml:mtext><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-136">136</xref>); in practice, <inline-formula id="ieqn-3167"><mml:math id="mml-ieqn-3167"><mml:mtext>&#x1D5C6;</mml:mtext><mml:mo>&#x226A;</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula>.</p></fn></p>
<disp-quote><p>&#x201C;The minibatch size <inline-formula id="ieqn-891"><mml:math id="mml-ieqn-891"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> is typically chosen to be a relatively small number of examples, ranging from one to a few hundred. Crucially, <inline-formula id="ieqn-892"><mml:math id="mml-ieqn-892"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> is usually held fixed as the training set size <inline-formula id="ieqn-893"><mml:math id="mml-ieqn-893"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula> grows. We may fit a training set with billions of examples using updates computed on only a hundred examples.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 148.</p>
</disp-quote><p>Generated as in Eq. (<xref ref-type="disp-formula" rid="eqn-136">136</xref>), the random-index sets <inline-formula id="ieqn-894"><mml:math id="mml-ieqn-894"><mml:msubsup><mml:mrow><mml:mtext>&#x1D540;</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>, for <inline-formula id="ieqn-895"><mml:math id="mml-ieqn-895"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo></mml:math></inline-formula>, are non-overlapping such that after <inline-formula id="ieqn-896"><mml:math id="mml-ieqn-896"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> iterations, all examples in the training set are covered, and a training session, or training epoch,<xref ref-type="fn" rid="fn145"><sup>145</sup></xref><fn id="fn145"><label>145</label><p>An epoch, or training session, &#x03C4; is explicitly defined here as when the minibatches as generated in <xref ref-type="disp-formula" rid="eqn-135">Eqs. (135)</xref>-(<xref ref-type="disp-formula" rid="eqn-137">137</xref>) covered the whole dataset. In [<xref ref-type="bibr" rid="ref-78">78</xref>], the first time the word &#x201C;epoch&#x201D; appeared was in Figure <xref ref-type="fig" rid="fig-7">7</xref>.3 caption, p. 239, where it was defined as a &#x201C;training iteration&#x201D;, but there was no explicit definition of &#x201C;epoch&#x201D; (when it started and when it ended), except indirectly as a &#x201C;training pass through the dataset&#x201D;, p. 274. See Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.</p></fn> is completed (line <xref ref-type="fig" rid="fig-162">6</xref> in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>). At iteration <inline-formula id="ieqn-897"><mml:math id="mml-ieqn-897"><mml:mi>k</mml:mi></mml:math></inline-formula> of a training epoch <inline-formula id="ieqn-898"><mml:math id="mml-ieqn-898"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>, the random minibatch <inline-formula id="ieqn-899"><mml:math id="mml-ieqn-899"><mml:msubsup><mml:mrow><mml:mtext>&#x1D539;</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> is a set of <inline-formula id="ieqn-900"><mml:math id="mml-ieqn-900"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> examples pulled out from the much larger training set <inline-formula id="ieqn-901"><mml:math id="mml-ieqn-901"><mml:mrow><mml:mtext>&#x1D569;</mml:mtext></mml:mrow></mml:math></inline-formula> using the random indices in <inline-formula id="ieqn-902"><mml:math id="mml-ieqn-902"><mml:msubsup><mml:mrow><mml:mtext>&#x1D540;</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>, with the corresponding targets in the set <inline-formula id="ieqn-903"><mml:math id="mml-ieqn-903"><mml:msubsup><mml:mrow><mml:mtext>&#x1D54B;</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-135"><label>(135)</label><mml:math id="mml-eqn-135" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>=:</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-136"><label>(136)</label><mml:math id="mml-eqn-136" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">I</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>&#x2286;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">I</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mtext>&#x00A0;for&#x00A0;</mml:mtext></mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-137"><label>(137)</label><mml:math id="mml-eqn-137" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">B</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>&#x2286;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-138"><label>(138)</label><mml:math id="mml-eqn-138" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">T</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>&#x2286;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Note that once the random index set <inline-formula id="ieqn-904"><mml:math id="mml-ieqn-904"><mml:msubsup><mml:mrow><mml:mtext>&#x1D540;</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> had been selected, it was deleted from its superset <inline-formula id="ieqn-905"><mml:math id="mml-ieqn-905"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to form <inline-formula id="ieqn-906"><mml:math id="mml-ieqn-906"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> so the next random set <inline-formula id="ieqn-907"><mml:math id="mml-ieqn-907"><mml:msubsup><mml:mrow><mml:mtext>&#x1D540;</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> would not contain indices already selected in <inline-formula id="ieqn-908"><mml:math id="mml-ieqn-908"><mml:msubsup><mml:mrow><mml:mtext>&#x1D540;</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>.</p>
<p>Unlike the iteration counter <inline-formula id="ieqn-909"><mml:math id="mml-ieqn-909"><mml:mi>k</mml:mi></mml:math></inline-formula> within a training epoch <inline-formula id="ieqn-910"><mml:math id="mml-ieqn-910"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>, the global iteration counter <inline-formula id="ieqn-911"><mml:math id="mml-ieqn-911"><mml:mi>j</mml:mi></mml:math></inline-formula> is not reset to 1 at the beginning of a new training epoch <inline-formula id="ieqn-912"><mml:math id="mml-ieqn-912"><mml:mi>&#x03C4;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, but continues to increment for each new minibatch. Plots versus epoch counter <inline-formula id="ieqn-913"><mml:math id="mml-ieqn-913"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> and plots versus global iteration counter <inline-formula id="ieqn-914"><mml:math id="mml-ieqn-914"><mml:mi>j</mml:mi></mml:math></inline-formula> could be confusing; see <xref ref-type="statement" rid="st6_17">Remark 6.17</xref> and Figure <xref ref-type="fig" rid="fig-78">78</xref>.</p>
<p><bold>Cost and gradient estimates.</bold> The cost-function estimate is the average of the cost functions, each of which is the cost function of an example <inline-formula id="ieqn-915"><mml:math id="mml-ieqn-915"><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> in the minibatch for iteration <inline-formula id="ieqn-916"><mml:math id="mml-ieqn-916"><mml:mi>k</mml:mi></mml:math></inline-formula> in training epoch <inline-formula id="ieqn-917"><mml:math id="mml-ieqn-917"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-139"><label>(139)</label><mml:math id="mml-eqn-139" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>a</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>a</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>a</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">B</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mtext>,</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>a</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">T</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where we wrote the random index as <inline-formula id="ieqn-918"><mml:math id="mml-ieqn-918"><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> instead of <inline-formula id="ieqn-919"><mml:math id="mml-ieqn-919"><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-136">136</xref>) to alleviate the notation. The corresponding gradient estimate is:</p>
<p><disp-formula id="eqn-140"><label>(140)</label><mml:math id="mml-eqn-140" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The pseudocode for the standard SGD<xref ref-type="fn" rid="fn146"><sup>146</sup></xref><fn id="fn146"><label>146</label><p>See also [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 286, Algorigthm 8.1; [<xref ref-type="bibr" rid="ref-80">80</xref>], p. 243, Algorithm 4.1.</p></fn> is given in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>. The epoch stopping criterion (line <xref ref-type="fig" rid="fig-162">1</xref> in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>) is usually determined by a computation &#x201C;budget&#x201D;, i.e., the maximum number of epochs allowed. For example, [<xref ref-type="bibr" rid="ref-145">145</xref>] set a budget of 1,600 epochs maximum in their numerical examples.</p>
<p><bold>Problems and resurgence of SGD.</bold> There are several known problems with SGD:</p>
<disp-quote><p>&#x201C;Despite the prevalent use of SGD, it has known challenges and inefficiencies. First, the direction may not represent a descent direction, and second, the method is sensitive to the step-size (learning rate) which is often poorly overestimated.&#x201D; [<xref ref-type="bibr" rid="ref-144">144</xref>]</p>
</disp-quote><p>For the above reasons, it may not be appropriate to use the norm of the gradient estimate being small as stationarity condition, i.e., where the local minimizer or saddle point is located; see the discussion in [<xref ref-type="bibr" rid="ref-145">145</xref>] and stochastic Newton Algorithm <xref ref-type="fig" rid="fig-165">7</xref> in Section <xref ref-type="sec" rid="s6_7">6.7</xref>.</p>
<p>Despite the above problems, SGD has been brought back to the forefront state-of-the-art algorithm to beat, surpassing the performance of adaptive methods, as confirmed by three recent papers: [<xref ref-type="bibr" rid="ref-55">55</xref>], [<xref ref-type="bibr" rid="ref-168">168</xref>], [<xref ref-type="bibr" rid="ref-56">56</xref>]; see Section <xref ref-type="sec" rid="s6_5_9">6.5.9</xref> on criticism of adaptive methods.</p>
<p><bold>Add-on tricks to improve SGD.</bold> The following tricks can be added onto the vanilla (standard) SGD to improve its performance; see also the pseudocode in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>:
<list list-type="bullet">
<list-item><p>Momentum and accelerated gradient: Improve (accelerate) convergence in narrow valleys, Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref></p></list-item>
<list-item><p>Initial-step-length tuning: Find effective initial step length <inline-formula id="ieqn-920"><mml:math id="mml-ieqn-920"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>, Section <xref ref-type="sec" rid="s6_3_3">6.3.3</xref></p></list-item>
<list-item><p>Step-length decaying or annealing: Find an effective learning-rate schedule<xref ref-type="fn" rid="fn147"><sup>147</sup></xref><fn id="fn147"><label>147</label><p>See Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.</p></fn> to decrease the step length <inline-formula id="ieqn-921"><mml:math id="mml-ieqn-921"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> as a function of epoch counter <inline-formula id="ieqn-922"><mml:math id="mml-ieqn-922"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> or global iteration counter <inline-formula id="ieqn-923"><mml:math id="mml-ieqn-923"><mml:mi>j</mml:mi></mml:math></inline-formula>, cyclic annealing, Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref></p></list-item>
<list-item><p>Minibatch-size increase, keeping step length fixed, equivalent annealing, Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref></p></list-item>
<list-item><p>Weight decay, Section <xref ref-type="sec" rid="s6_3_6">6.3.6</xref></p></list-item></list></p>
</sec>
<sec id="s6_3_2"><label>6.3.2</label>
<title>Momentum and fast (accelerated) gradient</title>
<p>The standard update for gradient descent is Eq. (<xref ref-type="disp-formula" rid="eqn-120">120</xref>) would be slow when encountering deep and narrow valley, as shown in Figure <xref ref-type="fig" rid="fig-63">63</xref>, and can be replaced by the general update with momentum as follows:</p>
<p><disp-formula id="eqn-141"><label>(141)</label><mml:math id="mml-eqn-141" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>from which the following methods are obtained (line <xref ref-type="fig" rid="fig-162">10</xref> in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>):
<list list-type="bullet">
<list-item><p>Standard SGD update Eq. (<xref ref-type="disp-formula" rid="eqn-120">120</xref>) with <inline-formula id="ieqn-926"><mml:math id="mml-ieqn-926"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-49">49</xref>]</p></list-item>
<list-item><p>SGD with classical momentum: <inline-formula id="ieqn-927"><mml:math id="mml-ieqn-927"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-928"><mml:math id="mml-ieqn-928"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (&#x201C;small heavy sphere&#x201D; or heavy point mass)<xref ref-type="fn" rid="fn148"><sup>148</sup></xref><fn id="fn148"><label>148</label><p>Often called by the more colloquial &#x201C;heavy ball&#x201D; method; see <xref ref-type="statement" rid="st6_6">Remark 6.6</xref>.</p></fn> [<xref ref-type="bibr" rid="ref-3">3</xref>]</p></list-item>
<list-item><p>SGD with fast (accelerated) gradient:<xref ref-type="fn" rid="fn149"><sup>149</sup></xref><fn id="fn149"><label>149</label><p>Sometimes referred to as Nesterov&#x2019;s Accelerated Gradient (NAG) in the deep-learning literature.</p></fn> <inline-formula id="ieqn-929"><mml:math id="mml-ieqn-929"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, Nesterov (1983 [<xref ref-type="bibr" rid="ref-50">50</xref>], 2018 [<xref ref-type="bibr" rid="ref-51">51</xref>])</p></list-item></list></p>
<fig id="fig-162">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-162.tif"/>
</fig>
<fig id="fig-63">
<label>Figure 63</label>
<caption><title><italic>SGD with momentum, small heavy sphere</italic> Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref>. The descent direction (negative gradient, black arrows) bounces back and forth between the steep slopes of a deep and narrow valley. The small-heavy-sphere method, or SGD with momentum, follows a faster descent (red path) toward the bottom of the valley. See the cost-function landscape with deep valleys in Figure <xref ref-type="fig" rid="fig-55">55</xref>. Figure from [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 289. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-63.tif"/>
</fig>
<p>The continuous counterpart of the parameter update Eq. (<xref ref-type="disp-formula" rid="eqn-141">141</xref>) with classical momentum, i.e., when <inline-formula id="ieqn-930"><mml:math id="mml-ieqn-930"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-931"><mml:math id="mml-ieqn-931"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, is the equation of motion of a heavy point mass (thus no rotatory inertia) under viscous friction at slow motion (proportional to velocity) and applied force <inline-formula id="ieqn-932"><mml:math id="mml-ieqn-932"><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> given below with its discretization by finite difference in time, where <inline-formula id="ieqn-933"><mml:math id="mml-ieqn-933"><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-934"><mml:math id="mml-ieqn-934"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> are the time-step sizes [<xref ref-type="bibr" rid="ref-169">169</xref>]:</p>
<p><disp-formula id="eqn-142"><label>(142)</label><mml:math id="mml-eqn-142" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mfrac><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mi>&#x03BD;</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mfrac></mml:mstyle><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mfrac><mml:mo>+</mml:mo><mml:mi>&#x03BD;</mml:mi><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-143"><label>(143)</label><mml:math id="mml-eqn-143" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mfrac><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>&#x03BD;</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>&#x03BD;</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is the same as the update Eq. (<xref ref-type="disp-formula" rid="eqn-141">141</xref>) with <inline-formula id="ieqn-935"><mml:math id="mml-ieqn-935"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>. The term <inline-formula id="ieqn-936"><mml:math id="mml-ieqn-936"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is often called the &#x201C;momentum&#x201D; term since it is proportional to (discretized) velocity. [<xref ref-type="bibr" rid="ref-3">3</xref>] on the other hand explained the term <inline-formula id="ieqn-937"><mml:math id="mml-ieqn-937"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as &#x201C;giving inertia to the motion, [leading] to motion along the &#x201C;essential&#x201D; direction, i.e. along &#x2018;the bottom of the trough&#x2019; &#x201D;, and recommended to select <inline-formula id="ieqn-938"><mml:math id="mml-ieqn-938"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0.8</mml:mn><mml:mo>,</mml:mo><mml:mn>0.99</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e., close to 1, without explanation. The reason is to have low friction, i.e., <inline-formula id="ieqn-939"><mml:math id="mml-ieqn-939"><mml:mi>&#x03BD;</mml:mi></mml:math></inline-formula> small, but not zero friction (<inline-formula id="ieqn-940"><mml:math id="mml-ieqn-940"><mml:mi>&#x03BD;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>), since friction is important to slow down the motion of the sphere up and down the valley sides (like skateboarding from side to side in a half-pipe), thus accelerate convergence toward the trough of the valley; from Eq. (<xref ref-type="disp-formula" rid="eqn-143">143</xref>), we have</p>
<p><disp-formula id="eqn-144"><label>(144)</label><mml:math id="mml-eqn-144" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>h</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>h</mml:mi><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03BD;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03BD;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<statement id="st6_5"><title>Remark 6.5.</title>
<p>The choice of the momentum parameter <inline-formula id="ieqn-941"><mml:math id="mml-ieqn-941"><mml:mi>&#x03B6;</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-141">141</xref>) is not trivial. If <inline-formula id="ieqn-942"><mml:math id="mml-ieqn-942"><mml:mi>&#x03B6;</mml:mi></mml:math></inline-formula> is too small, the signal will be too noisy; if <inline-formula id="ieqn-943"><mml:math id="mml-ieqn-943"><mml:mi>&#x03B6;</mml:mi></mml:math></inline-formula> is too large, &#x201C;the average will lag too far behind the (drifting) signal&#x201D; [<xref ref-type="bibr" rid="ref-165">165</xref>], p. 212. Even though Polyak (1964) [<xref ref-type="bibr" rid="ref-3">3</xref>] recommended to select <inline-formula id="ieqn-944"><mml:math id="mml-ieqn-944"><mml:mi>&#x03B6;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0.8</mml:mn><mml:mo>,</mml:mo><mml:mn>0.99</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, as explained above, it was reported in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 290: &#x201C;Common values of <inline-formula id="ieqn-945"><mml:math id="mml-ieqn-945"><mml:mi>&#x03B6;</mml:mi></mml:math></inline-formula> used in practice include 0.5, 0.9, and 0.99. Like the learning rate, <inline-formula id="ieqn-946"><mml:math id="mml-ieqn-946"><mml:mi>&#x03B6;</mml:mi></mml:math></inline-formula> may also be adapted over time. Typically it begins with a small value and is later raised. Adapting <inline-formula id="ieqn-947"><mml:math id="mml-ieqn-947"><mml:mi>&#x03B6;</mml:mi></mml:math></inline-formula> over time is less important than shrinking <inline-formula id="ieqn-948"><mml:math id="mml-ieqn-948"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> over time&#x201D;. The value of <inline-formula id="ieqn-949"><mml:math id="mml-ieqn-949"><mml:mi>&#x03B6;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula> would correspond to relatively high friction <inline-formula id="ieqn-950"><mml:math id="mml-ieqn-950"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula>, slowing down the motion of the sphere, compared to <inline-formula id="ieqn-951"><mml:math id="mml-ieqn-951"><mml:mi>&#x03B6;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.99</mml:mn></mml:math></inline-formula>.</p></statement>
<p>Figure <xref ref-type="fig" rid="fig-68">68</xref> from [<xref ref-type="bibr" rid="ref-170">170</xref>] shows the convergence of some adaptive learning-rate algorithms: <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_3_2">SGDNesterov</xref> (accelerated gradient), <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>.</p>
<p>In their remarkable paper, the authors of [<xref ref-type="bibr" rid="ref-55">55</xref>] used a constant momentum parameter <inline-formula id="ieqn-952"><mml:math id="mml-ieqn-952"><mml:mi>&#x03B6;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.9</mml:mn></mml:math></inline-formula>; see <xref ref-type="sec" rid="s6_5_9">criticism of adaptive methods in Section</xref> <xref ref-type="sec" rid="s6_5">6.5</xref> and Figure <xref ref-type="fig" rid="fig-73">73</xref> comparing <xref ref-type="sec" rid="s6_3_2">SGD</xref>, <xref ref-type="sec" rid="s6_3_2">SGD with momentum</xref>, <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>.<xref ref-type="fn" rid="fn150"><sup>150</sup></xref><fn id="fn150"><label>150</label><p>A nice animation of various optimizers (<xref ref-type="sec" rid="s6_3_2">SGD</xref>, <xref ref-type="sec" rid="s6_3_2">SGD with momentum</xref>, <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_6">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>) can be found in S. Ruder, &#x2018;An overview of gradient descent optimization algorithms&#x2019;, updated on 2018.09.02 (<ext-link ext-link-type="uri" xlink:href="https://ruder.io/optimizing-gradient-descent/">Original website</ext-link>).</p></fn></p>
<p>See Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p>
<p>For more insight into the update Eq. (<xref ref-type="disp-formula" rid="eqn-143">143</xref>), consider the case of constant coefficients <inline-formula id="ieqn-953"><mml:math id="mml-ieqn-953"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B6;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-954"><mml:math id="mml-ieqn-954"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>, and rewrite this recursive relation in the form:</p>
<p><disp-formula id="eqn-145"><label>(145)</label><mml:math id="mml-eqn-145" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;using&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>0</mml:mn></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>i.e., without momentum for the first term. So the effective gradient is the sum of all gradients from the beginning <inline-formula id="ieqn-955"><mml:math id="mml-ieqn-955"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> until the present <inline-formula id="ieqn-956"><mml:math id="mml-ieqn-956"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> weighted by the exponential function <inline-formula id="ieqn-957"><mml:math id="mml-ieqn-957"><mml:msup><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:math></inline-formula> so there is a fading memory effect, i.e., gradients that are farther back in time have less influence than those closer to the present time.<xref ref-type="fn" rid="fn151"><sup>151</sup></xref><fn id="fn151"><label>151</label><p>See also Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref> on time series and exponential smoothing.</p></fn> The summation term in Eq. (<xref ref-type="disp-formula" rid="eqn-145">145</xref>) also provides an explanation of how the &#x201C;inertia&#x201D; (or momentum) term work: (1) Two successive opposite gradients would cancel each other, whereas (2) Two successive gradients in the same direction (toward the trough of the valley) would reinforce each other. See also [<xref ref-type="bibr" rid="ref-171">171</xref>], pp. 104-105, and [<xref ref-type="bibr" rid="ref-172">172</xref>] who provided a similar explanation:</p>
<disp-quote><p>&#x201C;Momentum is a simple method for increasing the speed of learning when the objective function contains long, narrow and fairly straight ravines with a gentle but consistent gradient along the floor of the ravine and much steeper gradients up the sides of the ravine. The momentum method simulates a heavy ball rolling down a surface. The ball builds up velocity along the floor of the ravine, but not across the ravine because the opposing gradients on opposite sides of the ravine cancel each other out over time.&#x201D;</p>
</disp-quote><p>In recent years, Polyak (1964) [<xref ref-type="bibr" rid="ref-3">3</xref>] (English version)<xref ref-type="fn" rid="fn152"><sup>152</sup></xref><fn id="fn152"><label>152</label><p>Polyak (1964) [<xref ref-type="bibr" rid="ref-3">3</xref>]&#x2019;s English version appeared before 1979, as cited [<xref ref-type="bibr" rid="ref-173">173</xref>], where a similar classical dynamics of a &#x201C;small heavy sphere&#x201D; or heavy point mass was used to develop an iterative method to solve nonlinear systems. There, the name Polyak was spelled as &#x201C;Poljak&#x201D; as in the Russian version. The earliest citing of the Russian version, with the spelling &#x201C;Poljak&#x201D; was in [<xref ref-type="bibr" rid="ref-156">156</xref>] and in [<xref ref-type="bibr" rid="ref-174">174</xref>], but the terminology &#x201C;small heavy sphere&#x201D; was not used. See also [<xref ref-type="bibr" rid="ref-171">171</xref>], p. 104 and p. 481, where the Russian version of [<xref ref-type="bibr" rid="ref-3">3</xref>] was cited.</p></fn> has often been cited for the classical momentum (&#x201C;small heavy sphere&#x201D;) method to accelerate the convergence in gradient descent, but not so before, e.g., the authors of [<xref ref-type="bibr" rid="ref-22">22</xref>] [<xref ref-type="bibr" rid="ref-175">175</xref>] [<xref ref-type="bibr" rid="ref-176">176</xref>] [<xref ref-type="bibr" rid="ref-177">177</xref>] [<xref ref-type="bibr" rid="ref-172">172</xref>] used the same method without citing [<xref ref-type="bibr" rid="ref-3">3</xref>]. Several books on optimization not related to neural networks, many of them well-known, also did not mention this method: [<xref ref-type="bibr" rid="ref-139">139</xref>] [<xref ref-type="bibr" rid="ref-178">178</xref>] [<xref ref-type="bibr" rid="ref-149">149</xref>] [<xref ref-type="bibr" rid="ref-157">157</xref>] [<xref ref-type="bibr" rid="ref-148">148</xref>] [<xref ref-type="bibr" rid="ref-179">179</xref>]. Both the original Russian version and the English translated version [<xref ref-type="bibr" rid="ref-3">3</xref>] (whose author&#x2019;s name was spelled as &#x201C;Poljak&#x201D; before 1990) were cited in the book on neural networks [<xref ref-type="bibr" rid="ref-180">180</xref>], in which another neural-network book [<xref ref-type="bibr" rid="ref-171">171</xref>] was referred to for a discussion of the formulation.<xref ref-type="fn" rid="fn153"><sup>153</sup></xref><fn id="fn153"><label>153</label><p>See [<xref ref-type="bibr" rid="ref-180">180</xref>], p. 159, p. 115, and [<xref ref-type="bibr" rid="ref-171">171</xref>], p. 104, respectively. The name &#x201C;Polyak&#x201D; was spelled as &#x201C;Poljak&#x201D; before 1990, [<xref ref-type="bibr" rid="ref-171">171</xref>], p. 481, and sometimes as &#x201C;Polyack&#x201D;, [<xref ref-type="bibr" rid="ref-169">169</xref>]. See also [<xref ref-type="bibr" rid="ref-181">181</xref>].</p></fn></p> 
<statement id="st6_6"><title>Remark 6.6.</title>
<p><italic>Small heavy sphere, or heavy point mass, is better name</italic>. Because the rotatory motion is not considered in Eq. (<xref ref-type="disp-formula" rid="eqn-142">142</xref>), the name &#x201C;small heavy sphere&#x201D; given in [<xref ref-type="bibr" rid="ref-3">3</xref>] is more precise than the more colloquial name &#x201C;heavy ball&#x201D; often given to the SGD with classical momentum,<xref ref-type="fn" rid="fn154"><sup>154</sup></xref><fn id="fn154"><label>154</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-171">171</xref>], p. 104, [<xref ref-type="bibr" rid="ref-180">180</xref>], p. 115, [<xref ref-type="bibr" rid="ref-169">169</xref>], [<xref ref-type="bibr" rid="ref-181">181</xref>].</p></fn> since &#x201C;small&#x201D; implies that rotatory motion was neglected, and a &#x201C;heavy ball&#x201D; could be as big as a bowling ball<xref ref-type="fn" rid="fn155"><sup>155</sup></xref><fn id="fn155"><label>155</label><p>Or the &#x201C;Times Square Ball&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Times_Square_Ball&amp;oldid=932959767">version 05:17, 29 December 2019</ext-link>.</p></fn> for which rotatory motion cannot be neglected. For this reason, &#x201C;heavy point mass&#x201D; would be a precise alternative name.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st6_7"><title>Remark 6.7.</title>
<p>For Nesterov&#x2019;s fast (accelerated) gradient method, many references referred to [<xref ref-type="bibr" rid="ref-50">50</xref>].<xref ref-type="fn" rid="fn156"><sup>156</sup></xref><fn id="fn156"><label>156</label><p>Reference [<xref ref-type="bibr" rid="ref-50">50</xref>] cannot be found from the Web of Science as of 2020.03.18, perhaps because it was in Russian, as indicated in Ref. [<xref ref-type="bibr" rid="ref-35">35</xref>] in [<xref ref-type="bibr" rid="ref-51">51</xref>], p. 582, where Nesterov&#x2019;s 2004 monograph was Ref. [<xref ref-type="bibr" rid="ref-39">39</xref>].</p></fn> The authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 291, also referred to Nesterov&#x2019;s 2004 monograph, which was mentioned in the Preface of, and the material of which was included in, [<xref ref-type="bibr" rid="ref-51">51</xref>]. For a special class of strongly convex functions,<xref ref-type="fn" rid="fn157"><sup>157</sup></xref><fn id="fn157"><label>157</label><p>A function <inline-formula id="ieqn-3169"><mml:math id="mml-ieqn-3169"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is strongly convex if there is a constant <inline-formula id="ieqn-3170"><mml:math id="mml-ieqn-3170"><mml:mi>m</mml:mi><mml:mi>u</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> such that for any two points <inline-formula id="ieqn-3171"><mml:math id="mml-ieqn-3171"><mml:mi>x</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-3172"><mml:math id="mml-ieqn-3172"><mml:mi>y</mml:mi></mml:math></inline-formula>, we have <inline-formula id="ieqn-3173"><mml:math id="mml-ieqn-3173"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2265;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>x</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mi>&#x03BC;</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>x</mml:mi><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, where <inline-formula id="ieqn-3174"><mml:math id="mml-ieqn-3174"><mml:mo fence="false" stretchy="false">&#x2329;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> is the inner (or dot) product, [<xref ref-type="bibr" rid="ref-51">51</xref>], p. 74.</p></fn> the step length can be kept constant, while the coefficients in Nesterov&#x2019;s fast gradient method varied, to achieve optimal performance, [<xref ref-type="bibr" rid="ref-51">51</xref>], p. 92. &#x201C;Unfortunately, in the stochastic gradient case, Nesterov momentum does not improve the rate of convergence&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 292.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec>
<sec id="s6_3_3"><label>6.3.3</label>
<title>Initial-step-length tuning</title>
<p>The initial step length <inline-formula id="ieqn-958"><mml:math id="mml-ieqn-958"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>, or learning-rate initial value, is one of the two most influential hyperparameters to tune, i.e., to find the best performing values. During tuning, the step length <inline-formula id="ieqn-959"><mml:math id="mml-ieqn-959"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> is kept constant at <inline-formula id="ieqn-960"><mml:math id="mml-ieqn-960"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> in the parameter update Eq. (<xref ref-type="disp-formula" rid="eqn-120">120</xref>) throughout the optimization process, i.e., a fixed step length is used, without decay as in <xref ref-type="disp-formula" rid="eqn-147">Eqs. (147</xref>-<xref ref-type="disp-formula" rid="eqn-150">150)</xref> in Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref>.</p>
<p>The following simple tuning method was proposed in [<xref ref-type="bibr" rid="ref-55">55</xref>]:
<disp-quote><p>&#x201C;To tune the step sizes, we evaluated a logarithmically-spaced grid of five step sizes. If the best performance was ever at one of the extremes of the grid, we would try new grid points so that the best performance was contained in the middle of the parameters. For example, if we initially tried step sizes 2, 1, 0.5, 0.25, and 0.125 and found that 2 was the best performing, we would have tried the step size 4 to see if performance was improved. If performance improved, we would have tried 8 and so on.&#x201D;</p></disp-quote></p>
<p>The above logarithmically-spaced grid was given by <inline-formula id="ieqn-961"><mml:math id="mml-ieqn-961"><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mi>k</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>, with <inline-formula id="ieqn-962"><mml:math id="mml-ieqn-962"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula>. This tuning method appears effective as shown in Figure <xref ref-type="fig" rid="fig-73">73</xref> on the CIFAR-10 dataset mentioned above, for which the following values for <inline-formula id="ieqn-963"><mml:math id="mml-ieqn-963"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> had been tried for different optimizers, even though the values did not always belong to the sequence <inline-formula id="ieqn-964"><mml:math id="mml-ieqn-964"><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, but could include close, rounded values:
<list list-type="bullet">
<list-item><p><xref ref-type="sec" rid="s6_3_1">SGD</xref> (Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref>): 2, 1, 0.5 (best), 0.25, 0.05, 0.01<xref ref-type="fn" rid="fn158"><sup>158</sup></xref><fn id="fn158"><label>158</label><p>The last two values <inline-formula id="ieqn-3175"></inline-formula> did not belong to the sequence <inline-formula id="ieqn-3176"></inline-formula>, with <inline-formula id="ieqn-3177"><mml:math id="mml-ieqn-3177"><mml:mi>k</mml:mi></mml:math></inline-formula> being integers, since <inline-formula id="ieqn-3178"><mml:math id="mml-ieqn-3178"><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>0.125</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-3179"><mml:math id="mml-ieqn-3179"><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>0.0625</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-3180"><mml:math id="mml-ieqn-3180"><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>0.03125</mml:mn></mml:math></inline-formula>.</p></fn></p></list-item>
<list-item><p><xref ref-type="sec" rid="s6_3_2">SGD with momentum</xref> (Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref>): 2, 1, 0.5 (best), 0.25, 0.05, 0.01</p></list-item>
<list-item><p><xref ref-type="sec" rid="s6_5_2">AdaGrad</xref> (Section <xref ref-type="sec" rid="s6_5">6.5</xref>): 0.1, 0.05, 0.01 (best, default), 0.0075, 0.005</p></list-item>
<list-item><p><xref ref-type="sec" rid="s6_5_4">RMSProp</xref> (Section <xref ref-type="sec" rid="s6_5">6.5</xref>): 0.005, 0.001, 0.0005, 0.0003 (best), 0.0001</p></list-item>
<list-item><p><xref ref-type="sec" rid="s6_5_6">Adam</xref> (Section <xref ref-type="sec" rid="s6_5">6.5</xref>): 0.005, 0.001 (default), 0.0005, 0.0003 (best), 0.0001, 0.00005</p></list-item></list></p></sec>
<sec id="s6_3_4"><label>6.3.4</label>
<title>Step-length decay, annealing and cyclic annealing</title>
<p>In the update of the parameter <inline-formula id="ieqn-965"><mml:math id="mml-ieqn-965"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-120">120</xref>), the learning rate (step length) <inline-formula id="ieqn-966"><mml:math id="mml-ieqn-966"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> has to be reduced gradually as a function of either the epoch counter <inline-formula id="ieqn-967"><mml:math id="mml-ieqn-967"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> or of the global iteration counter <inline-formula id="ieqn-968"><mml:math id="mml-ieqn-968"><mml:mi>j</mml:mi></mml:math></inline-formula>. Let <inline-formula id="ieqn-969"><mml:math id="mml-ieqn-969"><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow></mml:math></inline-formula> represents either <inline-formula id="ieqn-970"><mml:math id="mml-ieqn-970"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> or <inline-formula id="ieqn-971"><mml:math id="mml-ieqn-971"><mml:mi>j</mml:mi></mml:math></inline-formula>, depending on user&#x2019;s choice.<xref ref-type="fn" rid="fn159"><sup>159</sup></xref><fn id="fn159"><label>159</label><p>The Avant Garde font &#x2020; is used to avoid confusion with <inline-formula id="ieqn-3182"><mml:math id="mml-ieqn-3182"><mml:mi>t</mml:mi></mml:math></inline-formula>, the time variable used in relation to recurrent neural networks; see Section <xref ref-type="sec" rid="s7">7</xref> on &#x201C;Dynamics, sequential data, sequence modeling&#x201D;, and Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> on &#x201C;Dynamics, time dependence, Volterra series&#x201D;. Many papers on deep-learning optimizers used <inline-formula id="ieqn-3183"><mml:math id="mml-ieqn-3183"><mml:mi>t</mml:mi></mml:math></inline-formula> as global iteration counter, which is denoted by <inline-formula id="ieqn-3184"><mml:math id="mml-ieqn-3184"><mml:mi>j</mml:mi></mml:math></inline-formula> here; see, e.g., [<xref ref-type="bibr" rid="ref-182">182</xref>], [<xref ref-type="bibr" rid="ref-56">56</xref>].</p></fn> If the learning rate <inline-formula id="ieqn-972"><mml:math id="mml-ieqn-972"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> is a function of epoch <inline-formula id="ieqn-973"><mml:math id="mml-ieqn-973"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>, then <inline-formula id="ieqn-974"><mml:math id="mml-ieqn-974"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> is held constant in all iterations <inline-formula id="ieqn-975"><mml:math id="mml-ieqn-975"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> within epoch <inline-formula id="ieqn-976"><mml:math id="mml-ieqn-976"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>, and we have the relation:</p>
<p><disp-formula id="eqn-146"><label>(146)</label><mml:math id="mml-eqn-146" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>k</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The following learning-rate scheduling, linear with respect to <inline-formula id="ieqn-977"><mml:math id="mml-ieqn-977"><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow></mml:math></inline-formula>, is one option:<xref ref-type="fn" rid="fn160"><sup>160</sup></xref><fn id="fn160"><label>160</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 287, where it was suggested that &#x2020;<sub><italic>c</italic></sub> in Eq. (<xref ref-type="disp-formula" rid="eqn-147">147</xref>) would be &#x201C;set to the number of iterations required to make a few hundred passes through the training set,&#x201D; and &#x2208;<sub>&#x2020;<sub><italic>c</italic></sub></sub> &#x201C;should be set to roughly 1 percent the value of <inline-formula id="ieqn-3187"><mml:math id="mml-ieqn-3187"><mml:msub><mml:mi>&#x2208;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>&#x201D;. A &#x201C;few hundred passes through the training set&#x201D; means a few hundred epochs; see Footnote <xref ref-type="fn" rid="fn161">161</xref>. In [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 286, Algorithm 1 SGD for &#x201C;training iteration <inline-formula id="ieqn-3188"><mml:math id="mml-ieqn-3188"><mml:mi>k</mml:mi></mml:math></inline-formula>&#x201D; should mean for &#x201C;training epoch <inline-formula id="ieqn-3189"><mml:math id="mml-ieqn-3189"><mml:mi>k</mml:mi></mml:math></inline-formula>&#x201D;, and the learning rate &#x201C;<inline-formula id="ieqn-3190"><mml:math id="mml-ieqn-3190"><mml:msub><mml:mi>&#x2208;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>&#x201D; would be held constant within &#x201C;epoch <inline-formula id="ieqn-3191"><mml:math id="mml-ieqn-3191"><mml:mi>k</mml:mi></mml:math></inline-formula>&#x201D;.</p></fn></p>
<p><disp-formula id="eqn-147"><label>(147)</label><mml:math id="mml-eqn-147" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mfrac><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mstyle></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#x00A0;for&#x00A0;</mml:mtext></mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#x00A0;for&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-148"><label>(148)</label><mml:math id="mml-eqn-148" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>epoch&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mtext>&#x00A0;or global iteration&#x00A0;</mml:mtext></mml:mrow><mml:mi>j</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-978"><mml:math id="mml-ieqn-978"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is the learning-rate initial value, and <inline-formula id="ieqn-979"><mml:math id="mml-ieqn-979"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> the constant learning-rate value when <inline-formula id="ieqn-980"><mml:math id="mml-ieqn-980"><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo>&#x2265;</mml:mo><mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula>. Other possible learning-rate schedules are:<xref ref-type="fn" rid="fn161"><sup>161</sup></xref><fn id="fn161"><label>161</label><p>See [<xref ref-type="bibr" rid="ref-182">182</xref>], p. 3, below Algorithm 1 and just below the equation labeled &#x201C;(Sgd)&#x201D;. After, say, 400 global iterations, i.e., &#x2020; = <italic>j</italic> = 400, then &#x2208;<sub>400</sub> = 5&#x0025;&#x2208;<sub>0</sub> according to Eq. (<xref ref-type="disp-formula" rid="eqn-149">149</xref>), and <inline-formula id="ieqn-3194"><mml:math id="mml-ieqn-3194"><mml:msub><mml:mi>&#x2208;</mml:mi><mml:mrow><mml:mn>400</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.25</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> according to Eq. (<xref ref-type="disp-formula" rid="eqn-150">150</xref>), whereas <inline-formula id="ieqn-3195"><mml:math id="mml-ieqn-3195"><mml:msub><mml:mi>&#x2208;</mml:mi><mml:mrow><mml:mn>400</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> according to Eq. (<xref ref-type="disp-formula" rid="eqn-147">147</xref>). See Footnote <xref ref-type="fn" rid="fn160">160</xref>, and also Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.</p></fn></p>
<p><disp-formula id="eqn-149"><label>(149)</label><mml:math id="mml-eqn-149" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:msqrt><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow></mml:msqrt></mml:mfrac><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>0</mml:mn><mml:mrow><mml:mtext>&#x00A0;as&#x00A0;</mml:mtext></mml:mrow><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-150"><label>(150)</label><mml:math id="mml-eqn-150" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow></mml:mfrac><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>0</mml:mn><mml:mrow><mml:mtext>&#x00A0;as&#x00A0;</mml:mtext></mml:mrow><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-981"><mml:math id="mml-ieqn-981"><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow></mml:math></inline-formula> defined as in Eq. (<xref ref-type="disp-formula" rid="eqn-147">147</xref>), even though authors such as [<xref ref-type="bibr" rid="ref-182">182</xref>] and [<xref ref-type="bibr" rid="ref-183">183</xref>] used Eq. (<xref ref-type="disp-formula" rid="eqn-149">149</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-150">150</xref>) with <inline-formula id="ieqn-982"><mml:math id="mml-ieqn-982"><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mi>j</mml:mi></mml:math></inline-formula> as global iteration counter.</p>
<p>Another step-length decay method proposed in [<xref ref-type="bibr" rid="ref-55">55</xref>] is to reduce the step length <inline-formula id="ieqn-983"><mml:math id="mml-ieqn-983"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for the current epoch <inline-formula id="ieqn-984"><mml:math id="mml-ieqn-984"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> by a factor <inline-formula id="ieqn-985"><mml:math id="mml-ieqn-985"><mml:mi>&#x03D6;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> when the cost estimate <inline-formula id="ieqn-986"><mml:math id="mml-ieqn-986"><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> at the end of the last epoch <inline-formula id="ieqn-987"><mml:math id="mml-ieqn-987"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is greater than the lowest cost in all previous global iterations, with <inline-formula id="ieqn-988"><mml:math id="mml-ieqn-988"><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> denoting the cost estimate at global iteration <inline-formula id="ieqn-989"><mml:math id="mml-ieqn-989"><mml:mi>j</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-990"><mml:math id="mml-ieqn-990"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the global iteration number at the end of epoch <inline-formula id="ieqn-991"><mml:math id="mml-ieqn-991"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-151"><label>(151)</label><mml:math id="mml-eqn-151" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mi>&#x03D6;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#x00A0;if&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x003E;</mml:mo><mml:munder><mml:mo form="prefix">min</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>&#x00A0;Otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Recall, <inline-formula id="ieqn-992"><mml:math id="mml-ieqn-992"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the number of non-overlapping minibatches that cover the training set, as defined in Eq. (<xref ref-type="disp-formula" rid="eqn-135">135</xref>). [<xref ref-type="bibr" rid="ref-55">55</xref>] set the step-length decay parameter <inline-formula id="ieqn-993"><mml:math id="mml-ieqn-993"><mml:mi>&#x03D6;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.9</mml:mn></mml:math></inline-formula> in their numerical examples, in particular Figure <xref ref-type="fig" rid="fig-73">73</xref>.</p>
<p><bold>Cyclic annealing.</bold> In additional to decaying the step length <inline-formula id="ieqn-994"><mml:math id="mml-ieqn-994"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>, which is already annealing, cyclic annealing is introduced to further reduce the step length down to zero (&#x201C;cooling&#x201D;), quicker than decaying, then bring the step length back up rapidly (heating), and doing so for several cycles. The cosine function is typically used, such as shown in Figure <xref ref-type="fig" rid="fig-75">75</xref>, as a multiplicative factor <inline-formula id="ieqn-995"><mml:math id="mml-ieqn-995"><mml:msub><mml:mrow><mml:mi>&#x1D51E;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> to the step length <inline-formula id="ieqn-996"><mml:math id="mml-ieqn-996"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in the parameter update, and thus the name &#x201C;cosine annealing&#x201D;:</p>
<p><disp-formula id="eqn-152"><label>(152)</label><mml:math id="mml-eqn-152" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>as an add-on to the parameter update for vanilla SGD Eq. (<xref ref-type="disp-formula" rid="eqn-120">120</xref>), or</p>
<p><disp-formula id="eqn-153"><label>(153)</label><mml:math id="mml-eqn-153" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>as an add-on to the parameter update for SGD with momentum and accelerated gradient Eq. (<xref ref-type="disp-formula" rid="eqn-141">141</xref>). The cosine annealing factor can take the form [<xref ref-type="bibr" rid="ref-56">56</xref>]:</p>
<p><disp-formula id="eqn-154"><label>(154)</label><mml:math id="mml-eqn-154" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn><mml:mo>+</mml:mo><mml:mn>0.5</mml:mn><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C0;</mml:mi><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>:=</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msub><mml:mi>T</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-997"><mml:math id="mml-ieqn-997"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the number of epochs from the start of the last warm restart at the end of epoch <inline-formula id="ieqn-998"><mml:math id="mml-ieqn-998"><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msub><mml:mi>T</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-999"><mml:math id="mml-ieqn-999"><mml:msub><mml:mrow><mml:mi>&#x1D51E;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> (&#x201C;maximum heating&#x201D;), <inline-formula id="ieqn-1000"><mml:math id="mml-ieqn-1000"><mml:mi>j</mml:mi></mml:math></inline-formula> the current global iteration counter, <inline-formula id="ieqn-1001"><mml:math id="mml-ieqn-1001"><mml:msub><mml:mi>T</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:math></inline-formula> the maximum number of epochs allowed for the current <inline-formula id="ieqn-1002"><mml:math id="mml-ieqn-1002"><mml:mi>p</mml:mi></mml:math></inline-formula>th annealing cycle, during which <inline-formula id="ieqn-1003"><mml:math id="mml-ieqn-1003"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> would go from <inline-formula id="ieqn-1004"><mml:math id="mml-ieqn-1004"><mml:mn>0</mml:mn></mml:math></inline-formula> to <inline-formula id="ieqn-1005"><mml:math id="mml-ieqn-1005"><mml:msub><mml:mi>T</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:math></inline-formula>, when <inline-formula id="ieqn-1006"><mml:math id="mml-ieqn-1006"><mml:msub><mml:mrow><mml:mi>&#x1D51E;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> (&#x201C;complete cooling&#x201D;). Figure <xref ref-type="fig" rid="fig-75">75</xref> shows 4 annealing cycles, which helped reduce dramatically the number of epochs needed to achieve the same lower cost as obtained without annealing.</p>
<p>Figure <xref ref-type="fig" rid="fig-74">74</xref> shows the effectiveness of cosine annealing in bringing down the cost rapidly in the early stage, but there is a diminishing return, as the cost reduction decreases with the number of annealing cycle. Up to a point, it is no longer as effective as SGD with weight decay in Section <xref ref-type="sec" rid="s6_3_6">6.3.6</xref>.</p>
<p><bold>Convergence conditions.</bold> The sufficient conditions for convergence, for convex functions, are<xref ref-type="fn" rid="fn162"><sup>162</sup></xref><fn id="fn162"><label>162</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-155">155</xref>) are called the &#x201C;stepsize requirements&#x201D; in [<xref ref-type="bibr" rid="ref-80">80</xref>], and &#x201C;sufficient condition for convergence&#x201D; in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 287, and in [<xref ref-type="bibr" rid="ref-184">184</xref>]. Robbins &amp; Monro (1951b) [<xref ref-type="bibr" rid="ref-49">49</xref>] were concerned with solving <inline-formula id="ieqn-3196"><mml:math id="mml-ieqn-3196"><mml:mi>M</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> when the function <inline-formula id="ieqn-3197"><mml:math id="mml-ieqn-3197"><mml:mi>M</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is not known, but the distribution of the output, as a random variable, <inline-formula id="ieqn-3198"><mml:math id="mml-ieqn-3198"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is assumed known. For the network training problem at hand, one can think of <inline-formula id="ieqn-3199"><mml:math id="mml-ieqn-3199"><mml:mi>M</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=&#x2225;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2225;</mml:mo></mml:math></inline-formula>, i.e., the magnitude of the gradient of the cost function <inline-formula id="ieqn-3200"><mml:math id="mml-ieqn-3200"><mml:mi>J</mml:mi></mml:math></inline-formula> at <inline-formula id="ieqn-3201"><mml:math id="mml-ieqn-3201"><mml:mi>x</mml:mi><mml:mo>=&#x2225;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">&#x2225;</mml:mo></mml:math></inline-formula>, the distance from a local minimizer, and <inline-formula id="ieqn-3202"><mml:math id="mml-ieqn-3202"><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>p</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, i.e., the stationarity point of <inline-formula id="ieqn-3203"><mml:math id="mml-ieqn-3203"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. In [<xref ref-type="bibr" rid="ref-49">49</xref>]&#x2013;in which there was no notion of &#x201C;epoch <inline-formula id="ieqn-3204"><mml:math id="mml-ieqn-3204"><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>u</mml:mi></mml:math></inline-formula>&#x201D; but only global iteration counter <inline-formula id="ieqn-3205"><mml:math id="mml-ieqn-3205"><mml:mi>j</mml:mi></mml:math></inline-formula>&#x2013;Eq. (<xref ref-type="disp-formula" rid="eqn-6">6</xref>) on p. 401 corresponds to Eq. (<xref ref-type="disp-formula" rid="eqn-155">155</xref>)<inline-formula id="ieqn-3206"><mml:math id="mml-ieqn-3206"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> (first part), and Eq. (<xref ref-type="disp-formula" rid="eqn-27">27</xref>) on p. 404 corresponds to Eq. (<xref ref-type="disp-formula" rid="eqn-155">155</xref>)<inline-formula id="ieqn-3207"><mml:math id="mml-ieqn-3207"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> (second part). Any sequence <inline-formula id="ieqn-3208"></inline-formula> that satisfied Eq. (<xref ref-type="disp-formula" rid="eqn-155">155</xref>) was called a sequence of type <inline-formula id="ieqn-3209"><mml:math id="mml-ieqn-3209"><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:math></inline-formula>; the convergence Theorem 1 on p. 404 and Theorem 2 on p. 405 indicated that the sequence of step length <inline-formula id="ieqn-3210"></inline-formula> being of type <inline-formula id="ieqn-3211"><mml:math id="mml-ieqn-3211"><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:math></inline-formula> was only one among other sufficient conditions for convergence. In Theorem 2 of [<xref ref-type="bibr" rid="ref-49">49</xref>], the additional sufficient conditions were Eq. (<xref ref-type="disp-formula" rid="eqn-33">33</xref>), <inline-formula id="ieqn-3212"><mml:math id="mml-ieqn-3212"><mml:mi>M</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=&#x2225;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2225;</mml:mo></mml:math></inline-formula> non decreasing, Eq. (<xref ref-type="disp-formula" rid="eqn-34">34</xref>), <inline-formula id="ieqn-3213"><mml:math id="mml-ieqn-3213"><mml:mi>M</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=&#x2225;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2225;=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, and Eq. (<xref ref-type="disp-formula" rid="eqn-35">35</xref>), <inline-formula id="ieqn-3214"><mml:math id="mml-ieqn-3214"><mml:mrow><mml:mi>M</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2225;</mml:mo><mml:msup><mml:mo>&#x2207;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mi>J</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mtext>&#x003E;&#x00A0;</mml:mtext><mml:mn>0</mml:mn></mml:mrow></mml:math></inline-formula>, i.e., the iterates <inline-formula id="ieqn-3215"><mml:math id="mml-ieqn-3215"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x007C;</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>&#x2026;</mml:mn></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, fell into a local convex bowl.</p></fn></p>
<p><disp-formula id="eqn-155"><label>(155)</label><mml:math id="mml-eqn-155" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x003C;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The inequality on the left of Eq. (<xref ref-type="disp-formula" rid="eqn-155">155</xref>), i.e., the sum of the squared of the step lengths being finite, ensures that the step length would decay quickly to reach the minimum, but is valid only when the minibatch size is fixed. The equation on the right of Eq. (<xref ref-type="disp-formula" rid="eqn-155">155</xref>) ensures convergence, no matter how far the initial guess was from the minimum [<xref ref-type="bibr" rid="ref-164">164</xref>].</p>
<p>In Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>, the step-length decay is shown to be equivalent to minibatch-size increase and simulated annealing in the sense that there would be less fluctuation, and thus lower &#x201C;temperature&#x201D; (cooling) by analogy to the physics governed by the Langevin stochastic differential equation and its discrete version, which is analogous to the network parameter update.</p>
</sec>
<sec id="s6_3_5"><label>6.3.5</label>
<title>Minibatch-size increase, fixed step length, equivalent annealing</title>
<p>The minibatch parameter update from Eq. (<xref ref-type="disp-formula" rid="eqn-141">141</xref>), without momentum and accelerated gradient, which becomes Eq. (<xref ref-type="disp-formula" rid="eqn-120">120</xref>), can be rewritten to introduce the error due to the use of the minibatch gradient estimate <inline-formula id="ieqn-1007"><mml:math id="mml-ieqn-1007"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> instead of the full-batch gradient <inline-formula id="ieqn-1008"><mml:math id="mml-ieqn-1008"><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow></mml:math></inline-formula> as follows:</p>
<p><disp-formula id="eqn-156"><label>(156)</label><mml:math id="mml-eqn-156" display="block"><mml:mrow><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x21D2;</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x0394;&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant='fraktur'>e</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mtext>&#x00A0;&#x00A0;</mml:mtext></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1009"><mml:math id="mml-ieqn-1009"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-1010"><mml:math id="mml-ieqn-1010"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-1011"><mml:math id="mml-ieqn-1011"><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-1012"><mml:math id="mml-ieqn-1012"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>b</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the gradient estimate function using minibatch <inline-formula id="ieqn-1013"><mml:math id="mml-ieqn-1013"><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>To show that the gradient error has zero mean (average), based on the linearity of the expectation function <inline-formula id="ieqn-1014"><mml:math id="mml-ieqn-1014"><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> defined in Eq. (<xref ref-type="disp-formula" rid="eqn-67">67</xref>) (Footnote <xref ref-type="fn" rid="fn89">89</xref>), i.e.,</p>
<p><disp-formula id="eqn-157"><label>(157)</label><mml:math id="mml-eqn-157" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-158"><label>(158)</label><mml:math id="mml-eqn-158" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>from <xref ref-type="disp-formula" rid="eqn-137">Eqs. (135)</xref>-<xref ref-type="disp-formula" rid="eqn-137">(137)</xref> on the definition of minibatches and <xref ref-type="disp-formula" rid="eqn-139">Eqs. (139)</xref>-<xref ref-type="disp-formula" rid="eqn-140">(140)</xref> on the definition of the cost and gradient estimates (without omitting the iteration counter <inline-formula id="ieqn-1015"><mml:math id="mml-ieqn-1015"><mml:mi>k</mml:mi></mml:math></inline-formula>), we have</p>
<p><disp-formula id="eqn-159"><label>(159)</label><mml:math id="mml-eqn-159" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-160"><label>(160)</label><mml:math id="mml-eqn-160" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Or alternatively, the same result can be obtained with:</p>
<p><disp-formula id="eqn-161"><label>(161)</label><mml:math id="mml-eqn-161" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Next, the mean value of the &#x201C;square&#x201D; of the gradient error, i.e., <inline-formula id="ieqn-1016"><mml:math id="mml-ieqn-1016"><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula>, in which we omitted the iteration counter subscript <inline-formula id="ieqn-1017"><mml:math id="mml-ieqn-1017"><mml:mi>k</mml:mi></mml:math></inline-formula> to alleviate the notation, relies on some identities related to the covariance matrix <inline-formula id="ieqn-1018"><mml:math id="mml-ieqn-1018"><mml:mo>&#x27E8;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula>. The mean of the square matrix <inline-formula id="ieqn-1019"><mml:math id="mml-ieqn-1019"><mml:msubsup><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-1020"><mml:math id="mml-ieqn-1020"><mml:mo>{</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>}</mml:mo></mml:math></inline-formula> are two random <italic>row</italic> matrices, is the sum of the product of the mean values and the covariance matrix of these matrices<xref ref-type="fn" rid="fn163"><sup>163</sup></xref><fn id="fn163"><label>163</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-185">185</xref>], p. 36, Eq. (2.8.3).</p></fn></p>
<p><disp-formula id="eqn-162"><label>(162)</label><mml:math id="mml-eqn-162" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>+</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;or&#x00A0;</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1021"><mml:math id="mml-ieqn-1021"><mml:mo>&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> is the covariance matrix of <inline-formula id="ieqn-1022"><mml:math id="mml-ieqn-1022"><mml:msub><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1023"><mml:math id="mml-ieqn-1023"><mml:msub><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, and thus the covariance operator <inline-formula id="ieqn-1024"><mml:math id="mml-ieqn-1024"><mml:mo>&#x27E8;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> is bilinear due to the linearity of the mean (expectation) operator <inline-formula id="ieqn-1025"><mml:math id="mml-ieqn-1025"><mml:mo>&#x27E8;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-157">157</xref>):</p>
<p><disp-formula id="eqn-163"><label>(163)</label><mml:math id="mml-eqn-163" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>&#x27E8;</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x27E9;</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msup><mml:mrow><mml:mtext>&#x00A0;random</mml:mtext></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-163">163</xref>) is the key relation to derive an expression for the square of the gradient error <inline-formula id="ieqn-1026"><mml:math id="mml-ieqn-1026"><mml:mo>&#x27E8;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula>, which can be rewritten as the sum of four covariance matrices upon using Eq. (<xref ref-type="disp-formula" rid="eqn-162">162</xref>)<inline-formula id="ieqn-1027"><mml:math id="mml-ieqn-1027"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> and either Eq. (<xref ref-type="disp-formula" rid="eqn-160">160</xref>) or Eq. (<xref ref-type="disp-formula" rid="eqn-161">161</xref>), i.e., <inline-formula id="ieqn-1028"><mml:math id="mml-ieqn-1028"><mml:mo>&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as the four terms <inline-formula id="ieqn-1029"><mml:math id="mml-ieqn-1029"><mml:msubsup><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> cancel each other out:</p>
<p><disp-formula id="eqn-164"><label>(164)</label><mml:math id="mml-eqn-164" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>+</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the iteration counter <inline-formula id="ieqn-1030"><mml:math id="mml-ieqn-1030"><mml:mi>k</mml:mi></mml:math></inline-formula> had been omitted to alleviate the notation. Moreover, to simplify the notation further, the gradient related to an example is simply denoted by <inline-formula id="ieqn-1031"><mml:math id="mml-ieqn-1031"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> or <inline-formula id="ieqn-1032"><mml:math id="mml-ieqn-1032"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, with <inline-formula id="ieqn-1033"><mml:math id="mml-ieqn-1033"><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> for a minibatch, and <inline-formula id="ieqn-1034"><mml:math id="mml-ieqn-1034"><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula> for the full batch:</p>
<p><disp-formula id="eqn-165"><label>(165)</label><mml:math id="mml-eqn-165" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-166"><label>(166)</label><mml:math id="mml-eqn-166" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Now assume the covariance matrix of any pair of single-example gradients <inline-formula id="ieqn-1035"><mml:math id="mml-ieqn-1035"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1036"><mml:math id="mml-ieqn-1036"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> depends only on the parameters <inline-formula id="ieqn-1037"><mml:math id="mml-ieqn-1037"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula>, and is of the form:</p>
<p><disp-formula id="eqn-167"><label>(167)</label><mml:math id="mml-eqn-167" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">C</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1038"><mml:math id="mml-ieqn-1038"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the Kronecker delta. Using <xref ref-type="disp-formula" rid="eqn-165">Eqs. (165)</xref>-<xref ref-type="disp-formula" rid="eqn-166">(166)</xref> and Eq. (<xref ref-type="disp-formula" rid="eqn-167">167</xref>) in Eq. (<xref ref-type="disp-formula" rid="eqn-164">164</xref>), we obtain a simple expression for <inline-formula id="ieqn-1039"><mml:math id="mml-ieqn-1039"><mml:mo>&#x27E8;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula>:<xref ref-type="fn" rid="fn164"><sup>164</sup></xref><fn id="fn164"><label>164</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-168">168</xref>) possesses a simplicity elegance compared to the expression <inline-formula id="ieqn-3216"><mml:math id="mml-ieqn-3216"><mml:mo stretchy="false">&#x2329;</mml:mo><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C9;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-164">164</xref>], based on a different definition of gradient, with <inline-formula id="ieqn-3217"><mml:math id="mml-ieqn-3217"><mml:mi>N</mml:mi><mml:mo>&#x2261;</mml:mo><mml:mrow><mml:mo>&#x1D5AC;</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-3218"><mml:math id="mml-ieqn-3218"><mml:mi>B</mml:mi><mml:mo>&#x2261;</mml:mo><mml:mrow><mml:mo>&#x1D5C6;</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-3219"><mml:math id="mml-ieqn-3219"><mml:mi>o</mml:mi><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mo>&#x2261;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula>, but <inline-formula id="ieqn-3220"><mml:math id="mml-ieqn-3220"><mml:mi>F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C9;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2260;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">C</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p></fn></p>
<p><disp-formula id="eqn-168"><label>(168)</label><mml:math id="mml-eqn-168" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="fraktur">C</mml:mi></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The authors of [<xref ref-type="bibr" rid="ref-164">164</xref>] introduced the following stochastic differential equation as a continuous counterpart of the discrete parameter update Eq. (<xref ref-type="disp-formula" rid="eqn-156">156</xref>), as <inline-formula id="ieqn-1040"><mml:math id="mml-ieqn-1040"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-169"><label>(169)</label><mml:math id="mml-eqn-169" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1041"><mml:math id="mml-ieqn-1041"><mml:mrow><mml:mi>&#x1D52B;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the noise function, the continuous counterpart of the gradient error <inline-formula id="ieqn-1042"><mml:math id="mml-ieqn-1042"><mml:msub><mml:mrow><mml:mi>&#x1D522;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>:=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. The noise <inline-formula id="ieqn-1043"><mml:math id="mml-ieqn-1043"><mml:mrow><mml:mi>&#x1D52B;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is assumed to be Gaussian, i.e., with zero expectation (mean) and with covariance function of the form (see <xref ref-type="statement" rid="st6_9">Remark 6.9</xref> on Langevin stochastic differential equation):</p>
<p><disp-formula id="eqn-170"><label>(170)</label><mml:math id="mml-eqn-170" display="block"><mml:mrow><mml:mo>&#x2329;</mml:mo><mml:mi mathvariant='fraktur'>n</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x232A;</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;and</mml:mtext><mml:mo>&#x2329;</mml:mo><mml:mi mathvariant='fraktur'>n</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='fraktur'>n</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x232A;</mml:mo><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>&#x1D4D5;</mml:mi><mml:mi>&#x212D;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1044"><mml:math id="mml-ieqn-1044"><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> is the expectation of a function, <inline-formula id="ieqn-1045"><mml:math id="mml-ieqn-1045"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula> the &#x201C;noise scale&#x201D; or fluctuation factor, <inline-formula id="ieqn-1046"><mml:math id="mml-ieqn-1046"><mml:mrow><mml:mi mathvariant="fraktur">C</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the same gradient-error covariance matrix in Eq. (<xref ref-type="disp-formula" rid="eqn-167">167</xref>), and <inline-formula id="ieqn-1047"><mml:math id="mml-ieqn-1047"><mml:mrow><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula> the Dirac delta. Integrating Eq. (<xref ref-type="disp-formula" rid="eqn-169">169</xref>), we obtain:</p>
<p><disp-formula id="eqn-171"><label>(171)</label><mml:math id="mml-eqn-171" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:mrow><mml:mi mathvariant="fraktur">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mrow><mml:mo>&#x27E8;</mml:mo><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:mrow><mml:mi mathvariant="fraktur">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x27E9;</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The fluctuation factor <inline-formula id="ieqn-1048"><mml:math id="mml-ieqn-1048"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula> can be identified by equating the square of the error in Eq. (<xref ref-type="disp-formula" rid="eqn-156">156</xref>) to that in Eq. (<xref ref-type="disp-formula" rid="eqn-171">171</xref>), i.e.,</p>
<p><disp-formula id="eqn-172"><label>(172)</label><mml:math id="mml-eqn-172" display="block"><mml:mrow><mml:msup><mml:mi>&#x03F5;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>&#x2329;</mml:mo><mml:msup><mml:mi mathvariant='fraktur'>e</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant='fraktur'>e</mml:mi><mml:mo>&#x232A;</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mrow><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mstyle displaystyle='true'><mml:mrow><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>&#x2329;</mml:mo><mml:mi mathvariant='fraktur'>n</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='fraktur'>n</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x232A;</mml:mo><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x21D2;</mml:mo><mml:msup><mml:mi>&#x03F5;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mtext>&#x1D5C6;</mml:mtext></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mtext>&#x1D5AC;</mml:mtext></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>&#x212D;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mi>&#x1D4D5;</mml:mi><mml:mi>&#x212D;</mml:mi><mml:mo>&#x21D2;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>&#x1D4D5;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mtext>&#x1D5C6;</mml:mtext></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mtext>&#x1D5AC;</mml:mtext></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:math></disp-formula></p>
<statement id="st6_8"><title>Remark 6.8.</title>
<p><italic>Fluctuation factor for large training set</italic>. For large <inline-formula id="ieqn-1049"><mml:math id="mml-ieqn-1049"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula>, our fluctuation factor <inline-formula id="ieqn-1050"><mml:math id="mml-ieqn-1050"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula> is roughly proportional to the ratio of the step length over the minibatch size, i.e., <inline-formula id="ieqn-1051"><mml:math id="mml-ieqn-1051"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow><mml:mo>&#x2248;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula>. Thus step-length <inline-formula id="ieqn-1052"><mml:math id="mml-ieqn-1052"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> decay, or equivalenly minibatch size <inline-formula id="ieqn-1053"><mml:math id="mml-ieqn-1053"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> increase, corresponds to a decrease in the fluctuation factor <inline-formula id="ieqn-1054"><mml:math id="mml-ieqn-1054"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula>. On the other hand, [<xref ref-type="bibr" rid="ref-164">164</xref>] obtained their fluctuation factor <inline-formula id="ieqn-1055"><mml:math id="mml-ieqn-1055"><mml:mrow><mml:mi>&#x1D4D6;</mml:mi></mml:mrow></mml:math></inline-formula> as<xref ref-type="fn" rid="fn165"><sup>165</sup></xref><fn id="fn165"><label>165</label><p>In [<xref ref-type="bibr" rid="ref-186">186</xref>] and [<xref ref-type="bibr" rid="ref-164">164</xref>], the fluctuation factor was expressed, in original notation, as <inline-formula id="ieqn-3221"><mml:math id="mml-ieqn-3221"><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where the equivalence with our notation is <inline-formula id="ieqn-3222"><mml:math id="mml-ieqn-3222"><mml:mi>g</mml:mi><mml:mo>&#x2261;</mml:mo><mml:mo>&#x1D4D6;</mml:mo></mml:math></inline-formula> (fluctuation factor), <inline-formula id="ieqn-3223"><mml:math id="mml-ieqn-3223"><mml:mi>N</mml:mi><mml:mo>&#x2261;</mml:mo><mml:mo>&#x1D5AC;</mml:mo></mml:math></inline-formula> (training-set size), <inline-formula id="ieqn-3224"><mml:math id="mml-ieqn-3224"><mml:mi>B</mml:mi><mml:mo>&#x2261;</mml:mo><mml:mrow><mml:mo>&#x1D5C6;</mml:mo></mml:mrow></mml:math></inline-formula> (minibatch size).</p></fn></p>
<p><disp-formula id="eqn-173"><label>(173)</label><mml:math id="mml-eqn-173" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x1D4D6;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>since their cost function was not an average, i.e., not divided by the minibatch size <inline-formula id="ieqn-1056"><mml:math id="mml-ieqn-1056"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula>, unlike our cost function in Eq. (<xref ref-type="disp-formula" rid="eqn-139">139</xref>). When <inline-formula id="ieqn-1057"><mml:math id="mml-ieqn-1057"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, our fluctuation factor <inline-formula id="ieqn-1058"><mml:math id="mml-ieqn-1058"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-172">172</xref>), but their fluctuation factor <inline-formula id="ieqn-1059"><mml:math id="mml-ieqn-1059"><mml:mrow><mml:mi>&#x1D4D6;</mml:mi></mml:mrow><mml:mo>&#x2248;</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, i.e., for increasingly large <inline-formula id="ieqn-1060"><mml:math id="mml-ieqn-1060"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula>, our fluctuation factor <inline-formula id="ieqn-1061"><mml:math id="mml-ieqn-1061"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula> is bounded, but not their fluctuation factor <inline-formula id="ieqn-1062"><mml:math id="mml-ieqn-1062"><mml:mrow><mml:mi>&#x1D4D6;</mml:mi></mml:mrow></mml:math></inline-formula>. [<xref ref-type="bibr" rid="ref-186">186</xref>] then went on to show empirically that their fluctation factor <inline-formula id="ieqn-1063"><mml:math id="mml-ieqn-1063"><mml:mrow><mml:mi>&#x1D4D6;</mml:mi></mml:mrow></mml:math></inline-formula> was proportional to the training-set size <inline-formula id="ieqn-1064"><mml:math id="mml-ieqn-1064"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula> for large <inline-formula id="ieqn-1065"><mml:math id="mml-ieqn-1065"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula>, as shown in Figure <xref ref-type="fig" rid="fig-64">64</xref>. On the other hand, our fluctuation factor <inline-formula id="ieqn-1066"><mml:math id="mml-ieqn-1066"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula> does not depend on the training set size <inline-formula id="ieqn-1067"><mml:math id="mml-ieqn-1067"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula>. As a result, unlike [<xref ref-type="bibr" rid="ref-186">186</xref>] in Figure <xref ref-type="fig" rid="fig-64">64</xref>, our optimal minibatch size would not depend of the training-set size <inline-formula id="ieqn-1068"><mml:math id="mml-ieqn-1068"><mml:mrow><mml:mtext>&#x1D5AC;</mml:mtext></mml:mrow></mml:math></inline-formula>.</p></statement>
<fig id="fig-64">
<label>Figure 64</label>
<caption><title><italic>Optimal minibatch size vs. training-set size</italic> (Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>). For a given trainingset size, the smallest minibatch size that achieves the highest accuracy is optimal. Left figure: The optimal mimibatch size was moving to the right with increasing training-set size M. Right figure: The optimal minibatch size in [186] is linearly proportional to the training-set size M for large training sets (i.e., M &#x2190; &#x221E;), Eq. (<xref ref-type="disp-formula" rid="eqn-173">173</xref>), but our fluctuation factor &#x2131; is independent of M when M &#x2190; &#x221E;; see Remark <xref ref-type="statement" rid="st6_8">6.8</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-64.tif"/>
</fig>
<p>It was suggested in [<xref ref-type="bibr" rid="ref-164">164</xref>] to follow the same step-length decay schedules<xref ref-type="fn" rid="fn166"><sup>166</sup></xref><fn id="fn166"><label>166</label><p>See Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.</p></fn> <inline-formula id="ieqn-1069"><mml:math id="mml-ieqn-1069"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>&#x2020;</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref> to adjust the size of the minibatches, while keeping the step length constant at its initial value <inline-formula id="ieqn-1070"><mml:math id="mml-ieqn-1070"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>. To demonstrate the equivalence between decreasing the step length and increasing minibatch size, the CIFAR-10 dataset with three different training schedules as shown in Figure <xref ref-type="fig" rid="fig-65">65</xref> was used in [<xref ref-type="bibr" rid="ref-164">164</xref>].</p>
<p>The results are shown in Figure <xref ref-type="fig" rid="fig-66">66</xref>, where it was shown that the number of updates decreased drastically with minibatch-size increase, allowing for significantly shortening the training wall-clock time.</p>
<statement id="st6_9"><title>Remark 6.9.</title>
<p><italic>Langevin stochastic differential equation, annealing</italic>. Because the fluctuation factor <inline-formula id="ieqn-1071"><mml:math id="mml-ieqn-1071"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula> was proportional to the step length, and in physics, fluctuation decreases with temperature (cooling), &#x201C;decaying the learning rate (step length) is simulated annealing&#x201D;<xref ref-type="fn" rid="fn168"><sup>168</sup></xref><fn id="fn168"><label>168</label><p>&#x201C;In metallurgy and materials science, annealing is a heat treatment that alters the physical and sometimes chemical properties of a material to increase its ductility and reduce its hardness, making it more workable. It involves heating a material above its recrystallization temperature, maintaining a suitable temperature for a suitable amount of time, and then allow slow cooling.&#x201D; Wikepedia, &#x2018;Annealing (metallurgy)&#x2019;, Version <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Annealing_(metallurgy)&amp;oldid=928035952">11:06, 26 November 2019</ext-link>. The name &#x201C;simulated annealing&#x201D; came from the highly cited paper [<xref ref-type="bibr" rid="ref-163">163</xref>], which received more than 20,000 citations on Web of Science and more than 40,000 citations on Google Scholar as of 2020.01.17. See also Remark <xref ref-type="statement" rid="st6_10">6.10</xref> on &#x201C;Metaheuristics&#x201D;.</p></fn> [<xref ref-type="bibr" rid="ref-164">164</xref>]. Here, we will connect the step length to &#x201C;temperature&#x201D; based on the analogy of Eq. (<xref ref-type="disp-formula" rid="eqn-169">169</xref>), the continuous counterpart of the parameter update Eq. (<xref ref-type="disp-formula" rid="eqn-156">156</xref>). In particular, we point exact references that justify the assumptions in Eq. (<xref ref-type="disp-formula" rid="eqn-170">170</xref>).</p></statement>
<fig id="fig-65">
<label>Figure 65</label>
<caption><title><italic>Minibatch-size increase vs. step-length decay, training schedules</italic> (Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>). Left figure: Step length (learning rate) vs. number of epochs. Right figure: Minibatch size vs. number of epochs. Three learning-rate schedules<xref ref-type="fn" rid="fn167"><sup>167</sup></xref><fn id="fn167"><label>167</label><p>See Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.</p></fn> were used for training: (1) The step length was decayed by a factor of 5, from an initial value of <inline-formula id="ieqn-699"><mml:math id="mml-ieqn-699"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, at specific epochs (60, 120, 160), while the minibatch size was kept constant (blue line); (2) Hybrid, i.e., the step length was initially kept constant until epoch 120, then decreased by a factor of 5 at epoch 120, and by another factor of 5 at epoch 160 (green line); (3) The step length was kept constant, while the minibatch size was increased by a factor of 5, from an initial value of 128, at the same specific epochs, 60, 120, 160 (red line). See Figure <xref ref-type="fig" rid="fig-66">66</xref> for the results using the CIFAR-10 dataset [<xref ref-type="bibr" rid="ref-164">164</xref>] (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-65.tif"/>
</fig>
<p>Even though the authors of [<xref ref-type="bibr" rid="ref-164">164</xref>] referred to [<xref ref-type="bibr" rid="ref-187">187</xref>] for Eq. (<xref ref-type="disp-formula" rid="eqn-169">169</xref>), the decomposition of the parameter update in [<xref ref-type="bibr" rid="ref-187">187</xref>]:</p>
<p><disp-formula id="eqn-174"><label>(174)</label><mml:math id="mml-eqn-174" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">v</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">v</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>:=</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">e</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with the intriguing factor <inline-formula id="ieqn-1072"><mml:math id="mml-ieqn-1072"><mml:msqrt><mml:mrow><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:math></inline-formula> was consistent with the equivalent expression in [<xref ref-type="bibr" rid="ref-185">185</xref>], p. 53, Eq. (3.5.10),<xref ref-type="fn" rid="fn169"><sup>169</sup></xref><fn id="fn169"><label>169</label><p>In original notation used in [<xref ref-type="bibr" rid="ref-185">185</xref>], p. 53, Eq. (3.5.10) reads as <inline-formula id="ieqn-3225"><mml:math id="mml-ieqn-3225"><mml:mrow><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msqrt><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:msqrt></mml:math></inline-formula>, in which the noise <inline-formula id="ieqn-3226"><mml:math id="mml-ieqn-3226"><mml:mrow><mml:mi mathvariant='bold-italic'>&#x03B7;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> has zero mean, i.e., <inline-formula id="ieqn-3227"><mml:math id="mml-ieqn-3227"><mml:mo>&#x2329;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, and covariance matrix <inline-formula id="ieqn-3228"><mml:math id="mml-ieqn-3228"><mml:mo>&#x2329;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">B</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p></fn> which was obtained from the Fokker-Planck equation:</p>
<p><disp-formula id="eqn-175"><label>(175)</label><mml:math id="mml-eqn-175" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msqrt><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:msqrt><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1073"><mml:math id="mml-ieqn-1073"><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a nonlinear operator. The noise term <inline-formula id="ieqn-1074"><mml:math id="mml-ieqn-1074"><mml:msqrt><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msqrt><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is <italic>not</italic> related to the gradient error as in Eq. (<xref ref-type="disp-formula" rid="eqn-174">174</xref>), and is Gaussian with zero mean and covariance matrix of the form:</p>
<p><disp-formula id="eqn-176"><label>(176)</label><mml:math id="mml-eqn-176" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msqrt><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:msqrt><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="bold-italic">B</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The column matrix (or vector) <inline-formula id="ieqn-1075"><mml:math id="mml-ieqn-1075"><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-175">175</xref>) is called the <italic>drift vector</italic>, and the square matrix <inline-formula id="ieqn-1076"><mml:math id="mml-ieqn-1076"><mml:mrow><mml:mi mathvariant="bold-italic">B</mml:mi></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-176">176</xref>) the <italic>diffusion matrix</italic>, [<xref ref-type="bibr" rid="ref-185">185</xref>], p. 52. Eq. (<xref ref-type="disp-formula" rid="eqn-175">175</xref>) implies that <inline-formula id="ieqn-1077"><mml:math id="mml-ieqn-1077"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a continuous function, called the &#x201C;sample path&#x201D;.</p>
<p>To obtain a differential equation, Eq. (<xref ref-type="disp-formula" rid="eqn-175">175</xref>) can be rewritten as</p>
<p><disp-formula id="eqn-177"><label>(177)</label><mml:math id="mml-eqn-177" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msqrt><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:msqrt></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which shows that the derivative of <inline-formula id="ieqn-1078"><mml:math id="mml-ieqn-1078"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> does not exist when taking the limit as <inline-formula id="ieqn-1079"><mml:math id="mml-ieqn-1079"><mml:mtext>&#x0394;</mml:mtext><mml:mi>t</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, not only due to the factor <inline-formula id="ieqn-1080"><mml:math id="mml-ieqn-1080"><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msqrt><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:msqrt><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, but also due to the noise <inline-formula id="ieqn-1081"><mml:math id="mml-ieqn-1081"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, [<xref ref-type="bibr" rid="ref-185">185</xref>], p. 53.</p>
<p>The last term <inline-formula id="ieqn-1082"><mml:math id="mml-ieqn-1082"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msqrt><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:msqrt></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-177">177</xref>) corresponds to the random force <inline-formula id="ieqn-1083"><mml:math id="mml-ieqn-1083"><mml:mi>X</mml:mi></mml:math></inline-formula> exerted on a pollen particle by the viscous fluid molecules in the 1-D equation of motion of the pollen particle, as derived by Langevin and in his original notation, [<xref ref-type="bibr" rid="ref-188">188</xref>]:<xref ref-type="fn" rid="fn170"><sup>170</sup></xref><fn id="fn170"><label>170</label><p>See also &#x201C;Langevin equation&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Langevin_equation&amp;oldid=929100467">version 17:40, 3 December 2019</ext-link>.</p></fn></p>
<fig id="fig-66">
<label>Figure 66</label>
<caption><title><italic>Minibatch-size increase, fewer parameter updates, faster comutation</italic> (Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>). For each of the three training schedules in Figure <xref ref-type="fig" rid="fig-65">65</xref>, the same learning curve is plotted in terms of the number of epochs (left figure), and again in terms of the number of parameter updates (right figure), which shows the significant decrease in the number of parameter updates, and thus computational cost, for the training schedule with minibatch-size decrease. The blue curve ends at about 80,000 parameter updates for step-length decrease, whereas the red curve ends at about 29,000 parameter updates for minibatch-size decrease [<xref ref-type="bibr" rid="ref-164">164</xref>] (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-66.tif"/>
</fig>
<p><disp-formula id="eqn-178"><label>(178)</label><mml:math id="mml-eqn-178" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>m</mml:mi><mml:mfrac><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:msup><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>6</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>&#x03BC;</mml:mi><mml:mi>a</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mi>m</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">f</mml:mi></mml:mrow><mml:mi>v</mml:mi><mml:mo>+</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>v</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="fraktur">f</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mn>6</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>&#x03BC;</mml:mi><mml:mi>a</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1084"><mml:math id="mml-ieqn-1084"><mml:mi>m</mml:mi></mml:math></inline-formula> is the mass of the pollen particle, <inline-formula id="ieqn-1085"><mml:math id="mml-ieqn-1085"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> its displacement, <inline-formula id="ieqn-1086"><mml:math id="mml-ieqn-1086"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula> the fluid viscosity, <inline-formula id="ieqn-1087"><mml:math id="mml-ieqn-1087"><mml:mi>a</mml:mi></mml:math></inline-formula> the particle radius, <inline-formula id="ieqn-1088"><mml:math id="mml-ieqn-1088"><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the particle velocity, and <inline-formula id="ieqn-1089"><mml:math id="mml-ieqn-1089"><mml:mrow><mml:mi>&#x1D523;</mml:mi></mml:mrow></mml:math></inline-formula> the friction coefficient between the particle and the fluid. The random (noise) force <inline-formula id="ieqn-1090"><mml:math id="mml-ieqn-1090"><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by the fluid molecules impacting the pollen particle is assumed to (1) be independent of the position <inline-formula id="ieqn-1091"><mml:math id="mml-ieqn-1091"><mml:mi>x</mml:mi></mml:math></inline-formula>, (2) vary extremely rapidly compared to the change of <inline-formula id="ieqn-1092"><mml:math id="mml-ieqn-1092"><mml:mi>x</mml:mi></mml:math></inline-formula>, (3) have zero mean as in Eq. (<xref ref-type="disp-formula" rid="eqn-176">176</xref>). The covariance of this noise force <inline-formula id="ieqn-1093"><mml:math id="mml-ieqn-1093"><mml:mi>X</mml:mi></mml:math></inline-formula> is proportional to the absolute temperature <inline-formula id="ieqn-1094"><mml:math id="mml-ieqn-1094"><mml:mi>T</mml:mi></mml:math></inline-formula>, and takes the form, [<xref ref-type="bibr" rid="ref-189">189</xref>], p. 12,<xref ref-type="fn" rid="fn171"><sup>171</sup></xref><fn id="fn171"><label>171</label><p>For first-time learners, here a guide for further reading on a derivation of Eq. (<xref ref-type="disp-formula" rid="eqn-179">179</xref>). It is better to follow the book [<xref ref-type="bibr" rid="ref-189">189</xref>], rather than Coffey&#x2019;s 1985 long review paper, cited for equation <inline-formula id="ieqn-3229"><mml:math id="mml-ieqn-3229"><mml:mrow><mml:mover accent='true'><mml:mrow><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mi mathvariant='fraktur'>f</mml:mi><mml:mi>k</mml:mi><mml:mi>T</mml:mi><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula> on p. 12, not exactly the same left-hand side as in Eq. (<xref ref-type="disp-formula" rid="eqn-179">179</xref>), and for which the derivation appeared some 50 pages later in the book. The factor <inline-formula id="ieqn-3230"><mml:math id="mml-ieqn-3230"><mml:mn>2</mml:mn><mml:mrow><mml:mi mathvariant="fraktur">f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>T</mml:mi></mml:math></inline-formula> was called the <italic>spectral density</italic>. The time average <inline-formula id="ieqn-3231"><mml:math id="mml-ieqn-3231"><mml:mrow><mml:mover accent='true'><mml:mrow><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> was defined, and the name <italic>autocorrelation function</italic> introduced on p. 13. But a particular case of the more general autocorrelation function is <inline-formula id="ieqn-3232"><mml:math id="mml-ieqn-3232"><mml:mrow><mml:mo>&#x2329;</mml:mo> <mml:mrow><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x232A;</mml:mo></mml:mrow></mml:math></inline-formula> was defined on p. 59, and by the ergodic theorem, <inline-formula id="ieqn-3233"><mml:math id="mml-ieqn-3233"><mml:mrow><mml:mover accent='true'><mml:mrow><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x2329;</mml:mo> <mml:mrow><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x232A;</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> for stationary processes, p. 60, where the spectral density of <inline-formula id="ieqn-3234"><mml:math id="mml-ieqn-3234"><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, denoted by <inline-formula id="ieqn-3235"><mml:math id="mml-ieqn-3235"><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C9;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> was defined as the Fourier transform of the autocorrelation function <inline-formula id="ieqn-3236"><mml:math id="mml-ieqn-3236"><mml:mrow><mml:mrow><mml:mo>&#x2329;</mml:mo> <mml:mrow><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x232A;</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. The derivation of Eq. (<xref ref-type="disp-formula" rid="eqn-179">179</xref>) started from the beginning of Section 1.7, on p. 60, with the result obtained on p. 62, where the confusion of using the same notation <inline-formula id="ieqn-3237"><mml:math id="mml-ieqn-3237"><mml:mi>D</mml:mi></mml:math></inline-formula> in <inline-formula id="ieqn-3238"><mml:math id="mml-ieqn-3238"><mml:mn>2</mml:mn><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mrow><mml:mi mathvariant="fraktur">f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>T</mml:mi></mml:math></inline-formula>, the spectral density, and in <inline-formula id="ieqn-3239"><mml:math id="mml-ieqn-3239"><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="fraktur">f</mml:mi></mml:mrow></mml:math></inline-formula>, the diffusion coefficient [[<xref ref-type="bibr" rid="ref-189">189</xref>], p. 20, [<xref ref-type="bibr" rid="ref-185">185</xref>], p. 7] was noted.</p></fn></p>
<p><disp-formula id="eqn-179"><label>(179)</label><mml:math id="mml-eqn-179" display="block"><mml:mrow><mml:mo>&#x2329;</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x232A;</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mi mathvariant='fraktur'>f</mml:mi><mml:mi>k</mml:mi><mml:mi>T</mml:mi><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1095"><mml:math id="mml-ieqn-1095"><mml:mi>k</mml:mi></mml:math></inline-formula> denotes the Boltzmann constant.</p>
<p>The covariance of the noise <inline-formula id="ieqn-1096"><mml:math id="mml-ieqn-1096"><mml:mrow><mml:mi>&#x1D52B;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-170">170</xref>) is similar to the covariance of the noise <inline-formula id="ieqn-1097"><mml:math id="mml-ieqn-1097"><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-179">179</xref>), and thus the fluctuation factor <inline-formula id="ieqn-1098"><mml:math id="mml-ieqn-1098"><mml:mrow><mml:mi>&#x1D4D5;</mml:mi></mml:mrow></mml:math></inline-formula>, and hence the step length <inline-formula id="ieqn-1099"><mml:math id="mml-ieqn-1099"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-172">172</xref>), can be interpreted as being proportional to temperature <inline-formula id="ieqn-1100"><mml:math id="mml-ieqn-1100"><mml:mi>T</mml:mi></mml:math></inline-formula>. Therefore, decaying the step length <inline-formula id="ieqn-1101"><mml:math id="mml-ieqn-1101"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>, or increasing the minibatch size <inline-formula id="ieqn-1102"><mml:math id="mml-ieqn-1102"><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:math></inline-formula>, is equivalent to cooling down the temperature <inline-formula id="ieqn-1103"><mml:math id="mml-ieqn-1103"><mml:mi>T</mml:mi></mml:math></inline-formula>, and simulating the physical annealing, and hence the name <italic>simulated annealing</italic> (see <xref ref-type="statement" rid="st6_10">Remark 6.10</xref>).</p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-178">178</xref>) cannot be directly integrated to obtain the velocity <inline-formula id="ieqn-1104"><mml:math id="mml-ieqn-1104"><mml:mi>v</mml:mi></mml:math></inline-formula> in terms of the noise force <inline-formula id="ieqn-1105"><mml:math id="mml-ieqn-1105"><mml:mi>X</mml:mi></mml:math></inline-formula> since the derivative does not exist, as interpreted in Eq. (<xref ref-type="disp-formula" rid="eqn-177">177</xref>). Langevin went around this problem by multiplying Eq. (<xref ref-type="disp-formula" rid="eqn-178">178</xref>) by the displacement <inline-formula id="ieqn-1106"><mml:math id="mml-ieqn-1106"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and take the average to obtain, [<xref ref-type="bibr" rid="ref-188">188</xref>]:</p>
<p><disp-formula id="eqn-180"><label>(180)</label><mml:math id="mml-eqn-180" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mi>m</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>&#x03BC;</mml:mi><mml:mi>a</mml:mi><mml:mi>z</mml:mi><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>R</mml:mi><mml:mi>T</mml:mi></mml:mrow><mml:mi>N</mml:mi></mml:mfrac><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1107"><mml:math id="mml-ieqn-1107"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mi>d</mml:mi><mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> is the time derivative of the mean square displacement, <inline-formula id="ieqn-1108"><mml:math id="mml-ieqn-1108"><mml:mi>R</mml:mi></mml:math></inline-formula> the ideal gas constant, and <inline-formula id="ieqn-1109"><mml:math id="mml-ieqn-1109"><mml:mi>N</mml:mi></mml:math></inline-formula> the Avogadro number. Eq. (<xref ref-type="disp-formula" rid="eqn-180">180</xref>) can be integrated to yield an expression for <inline-formula id="ieqn-1110"><mml:math id="mml-ieqn-1110"><mml:mi>z</mml:mi></mml:math></inline-formula>, which led to Einstein&#x2019;s result for Brownian motion.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p> 
<statement id="st6_10"><title><xref ref-type="statement" rid="st6_10">Remark 6.10</xref>.</title>
<p><italic>Metaheuristics and nature-inspired optimization algorithms</italic>. There is a large class of nature-inspired optimization algorithms that implemented the general conceptual <italic>metaheuristics</italic>&#x2014;such as neighborhood search, multi-start, hill climbing, accepting negative moves, etc.&#x2014;and that include many well-known methods such as Evolutionary Algorithms (EAs), Artificial Bee Colony (ABC), Firefly Algorithm, etc. [<xref ref-type="bibr" rid="ref-190">190</xref>].</p>
<p>The most famous of these nature-inspired algorithms would be perhaps simulated annealing in [<xref ref-type="bibr" rid="ref-163">163</xref>], which is described in [<xref ref-type="bibr" rid="ref-191">191</xref>], p. 18, as being &#x201C;inspired by the annealing process of metals. It is a trajectory-based search algorithm starting with an initial guess solution at a high temperature and gradually cooling down the system. A move or new solution is accepted if it is better; otherwise, it is accepted with a probability, which makes it possible for the system to escape any local optima&#x201D;, i.e., the metaheuristic &#x201C;accepting negative moves&#x201D; mentioned in [<xref ref-type="bibr" rid="ref-190">190</xref>]. &#x201C;It is then expected that if the system is cooled down slowly enough, the global optimal solution can be reached&#x201D;, [<xref ref-type="bibr" rid="ref-191">191</xref>], p. 18; that&#x2019;s step-length decay or minibatch-size increase, as mentioned above. See also <xref ref-type="fn" rid="fn140">Footnotes 140</xref> and <xref ref-type="fn" rid="fn168">168</xref>.</p>
<p>For applications of these nature-inspired algorithms, we cite the following works, without detailed review: [<xref ref-type="bibr" rid="ref-191">191</xref>] [<xref ref-type="bibr" rid="ref-192">192</xref>] [<xref ref-type="bibr" rid="ref-193">193</xref>] [<xref ref-type="bibr" rid="ref-194">194</xref>] [<xref ref-type="bibr" rid="ref-195">195</xref>] [<xref ref-type="bibr" rid="ref-196">196</xref>] [<xref ref-type="bibr" rid="ref-197">197</xref>] [<xref ref-type="bibr" rid="ref-198">198</xref>] [<xref ref-type="bibr" rid="ref-199">199</xref>].&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec>
<sec id="s6_3_6"><label>6.3.6</label>
<title>Weight decay, avoiding overfit</title>
<p>Reducing, or decaying, the network parameters <inline-formula id="ieqn-1111"><mml:math id="mml-ieqn-1111"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> (which include the weights and the biases) is one method to avoid overfitting by adding a parameter-decay term to the update equation:</p>
<p><disp-formula id="eqn-181"><label>(181)</label><mml:math id="mml-eqn-181" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1112"><mml:math id="mml-ieqn-1112"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the decay parameter, and there the name &#x201C;weight decay&#x201D;, which is equivalent to SGD with <inline-formula id="ieqn-1113"><mml:math id="mml-ieqn-1113"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> regularization, by adding an extra penalty term in the cost function; see Eq. (<xref ref-type="disp-formula" rid="eqn-248">248</xref>) in Section <xref ref-type="sec" rid="s6_5_10">6.5.10</xref> on the adaptive learning-rate method <xref ref-type="sec" rid="s6_5_10">AdamW</xref>, where such equivalence is explaned following [<xref ref-type="bibr" rid="ref-56">56</xref>]. &#x201C;Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 117. Weight decay is only one among other forms of regularization, such as large learning rates, small batch sizes, and dropout, [<xref ref-type="bibr" rid="ref-200">200</xref>]. The effects of the weight-decay parameter <inline-formula id="ieqn-1114"><mml:math id="mml-ieqn-1114"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow></mml:math></inline-formula> in avoiding network model overfit is shown in Figure <xref ref-type="fig" rid="fig-67">67</xref>.</p>
<p>It was written in [<xref ref-type="bibr" rid="ref-201">201</xref>] that: &#x201C;In the neural network community the two most common methods to avoid overfitting are early stopping and weight decay [<xref ref-type="bibr" rid="ref-175">175</xref>]. Early stopping has the advantage of being quick, since it shortens the training time, but the disadvantage of being poorly defined and not making full use of the available data. Weight decay, on the other hand, has the advantage of being well defined, but the disadvantage of being quite time consuming&#x201D; (because of tuning). For examples of tuning the weight decay parameter <inline-formula id="ieqn-1115"><mml:math id="mml-ieqn-1115"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow></mml:math></inline-formula>, which is of the order of <inline-formula id="ieqn-1116"><mml:math id="mml-ieqn-1116"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, see [<xref ref-type="bibr" rid="ref-201">201</xref>] [<xref ref-type="bibr" rid="ref-56">56</xref>].</p>
<p>In the case of weight decay with cyclic annealing, both the step length <inline-formula id="ieqn-1117"><mml:math id="mml-ieqn-1117"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> and the weight decay parameter <inline-formula id="ieqn-1118"><mml:math id="mml-ieqn-1118"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow></mml:math></inline-formula> are scaled by the annealing multiplier <inline-formula id="ieqn-1119"><mml:math id="mml-ieqn-1119"><mml:msub><mml:mrow><mml:mi>&#x1D51E;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in the parameter update [<xref ref-type="bibr" rid="ref-56">56</xref>]:</p>
<p><disp-formula id="eqn-182"><label>(182)</label><mml:math id="mml-eqn-182" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The effectiveness of SGD with weight decay, with and without cyclic annealing, is presented in Figure <xref ref-type="fig" rid="fig-74">74</xref>.</p>
<fig id="fig-67">
<label>Figure 67</label>
<caption><title><italic>Weight decay</italic> (Section <xref ref-type="sec" rid="s6_3_6">6.3.6</xref>). Effects of magnitude of weight-decay parameter <inline-formula id="ieqn-700"><mml:math id="mml-ieqn-700"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow></mml:math></inline-formula>. Adapted from [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 116. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-67.tif"/>
</fig>
</sec>
<sec id="s6_3_7"><label>6.3.7</label>
<title>Combining all add-on tricks</title>
<p>To have a general parameter-update equation that combines all of the above add-on improvement tricks, start with the parameter update with momentum and accelerated gradient Eq. (<xref ref-type="disp-formula" rid="eqn-141">141</xref>)</p>
<p><disp-formula id="eqn-141a"><mml:math id="mml-eqn-141a" display="block"><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>and add the weight-decay term <inline-formula id="ieqn-1120"><mml:math id="mml-ieqn-1120"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> from Eq. (<xref ref-type="disp-formula" rid="eqn-181">181</xref>), then scale both the weight-decay term and the gradient-descent term <inline-formula id="ieqn-1121"><mml:math id="mml-ieqn-1121"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by the cyclic annealing multiplier <inline-formula id="ieqn-1122"><mml:math id="mml-ieqn-1122"><mml:msub><mml:mrow><mml:mi>&#x1D51E;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-182">182</xref>), leaving the momentum term <inline-formula id="ieqn-1123"><mml:math id="mml-ieqn-1123"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> alone, to obtain:</p>
<p><disp-formula id="eqn-183"><label>(183)</label><mml:math id="mml-eqn-183" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is included in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>.</p></sec></sec>
<sec id="s6_4"><label>6.4</label>
<title>Kaiming He initialization</title>
<p>All optimization algorithms discussed above share a crucial step, which crucially affects the convergence of training, especially as neural networks become &#x2018;deep&#x2019;: The <italic>initialization</italic> of the network&#x2019;s parameters <inline-formula id="ieqn-1124"><mml:math id="mml-ieqn-1124"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is not only related to the speed of convergence, it &#x201C;can determine whether the algorithm converges at all.&#x201D;<xref ref-type="fn" rid="fn172"><sup>172</sup></xref><fn id="fn172"><label>172</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 292.</p></fn> Convergence problems are typically related to the scaling of initial parameters. Large parameters result in large activations, which leads to exploding values during forward and backward propagation, i.e., evaluation of the loss function and computation of its gradient with respect to the parameters. Small parameters, on the other hand, may result in <italic>vanishing gradients</italic>, i.e., the loss function becomes insensitive to parameters, which causes the training process to stall. To be precise, considerations regarding initialization are related to weight-matrices <inline-formula id="ieqn-1125"><mml:math id="mml-ieqn-1125"><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, see Section <xref ref-type="sec" rid="s4_4">4.4</xref> on &#x201C;Network layer, detailed construct&#x201D;; bias vectors <inline-formula id="ieqn-1126"><mml:math id="mml-ieqn-1126"><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are usually initialized to zero, which is also assumed in the subsequent considerations.<xref ref-type="fn" rid="fn173"><sup>173</sup></xref><fn id="fn173"><label>173</label><p>Situations favoring a nonzero initialization of weights are explained in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 297.</p></fn></p>
<p>The <italic>Kaiming He initialization</italic> provides equally effective as simple means to overcome scaling issues observed when weights are randomly initialized using a normal distribution with fixed standard deviation. The key idea of the authors [<xref ref-type="bibr" rid="ref-127">127</xref>] is to have the same variance of weights for each of the network&#x2019;s layers. As opposed to the <italic>Xavier initialization</italic><xref ref-type="fn" rid="fn174"><sup>174</sup></xref><fn id="fn174"><label>174</label><p>See [<xref ref-type="bibr" rid="ref-202">202</xref>].</p></fn>, the nonlinearity of activation functions is accounted for. Consider the <inline-formula id="ieqn-1127"><mml:math id="mml-ieqn-1127"><mml:mi>l</mml:mi></mml:math></inline-formula>-th layer of a feedforward neural network, where the vector <inline-formula id="ieqn-1128"><mml:math id="mml-ieqn-1128"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> follows as affine function of inputs <inline-formula id="ieqn-1129"><mml:math id="mml-ieqn-1129"><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></p>
<p><disp-formula id="eqn-184"><label>(184)</label><mml:math id="mml-eqn-184" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1130"><mml:math id="mml-ieqn-1130"><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1131"><mml:math id="mml-ieqn-1131"><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> denote the layer&#x2019;s weight matrix and bias vector, respectively. The output of the layer <inline-formula id="ieqn-1132"><mml:math id="mml-ieqn-1132"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is given by element-wise application of the activation function <inline-formula id="ieqn-1133"><mml:math id="mml-ieqn-1133"><mml:mi>a</mml:mi></mml:math></inline-formula>, see Sections <xref ref-type="sec" rid="s4_4_1">4.4.1</xref> and <xref ref-type="sec" rid="s4_4_2">4.4.2</xref> for a detailed presentation. All components of the weight matrix <inline-formula id="ieqn-1134"><mml:math id="mml-ieqn-1134"><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are assumed to be independent of each other and to share the same probability distribution. The same holds for the components of the input vector <inline-formula id="ieqn-1135"><mml:math id="mml-ieqn-1135"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the output vector <inline-formula id="ieqn-1136"><mml:math id="mml-ieqn-1136"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. Additionally, elements of <inline-formula id="ieqn-1137"><mml:math id="mml-ieqn-1137"><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1138"><mml:math id="mml-ieqn-1138"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> shall be mutually independent. Further, it is assumed that <inline-formula id="ieqn-1139"><mml:math id="mml-ieqn-1139"><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1140"><mml:math id="mml-ieqn-1140"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> have zero mean (i.e., expectation, cf. Eq. (<xref ref-type="disp-formula" rid="eqn-67">67</xref>)) and are symmetric around zero. In this case, the variance of <inline-formula id="ieqn-1141"><mml:math id="mml-ieqn-1141"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> is given by</p>
<p><disp-formula id="eqn-185"><label>(185)</label><mml:math id="mml-eqn-185" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula><disp-formula id="eqn-186"><label>(186)</label><mml:math id="mml-eqn-186" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-187"><label>(187)</label><mml:math id="mml-eqn-187" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1142"><mml:math id="mml-ieqn-1142"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> denotes the width of the <inline-formula id="ieqn-1143"><mml:math id="mml-ieqn-1143"><mml:mi>&#x2113;</mml:mi></mml:math></inline-formula>-th layer and the fundamental relation <inline-formula id="ieqn-1144"><mml:math id="mml-ieqn-1144"><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mi>Y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>Y</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>Y</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>Y</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula> has been used along with the assumption of weights having a zero mean, i.e., <inline-formula id="ieqn-1145"><mml:math id="mml-ieqn-1145"><mml:mi>&#x1D53C;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>. The variance of some random variable <inline-formula id="ieqn-1146"><mml:math id="mml-ieqn-1146"><mml:mi>X</mml:mi></mml:math></inline-formula>, which is the expectation of the squared deviation of <inline-formula id="ieqn-1147"><mml:math id="mml-ieqn-1147"><mml:mi>X</mml:mi></mml:math></inline-formula> from its mean, i.e.,</p>
<p><disp-formula id="eqn-188"><label>(188)</label><mml:math id="mml-eqn-188" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>is a measure of the &#x201C;dispersion&#x201D; of <inline-formula id="ieqn-1148"><mml:math id="mml-ieqn-1148"><mml:mi>X</mml:mi></mml:math></inline-formula> around its mean value. The variance is the square of the standard deviation <inline-formula id="ieqn-1149"><mml:math id="mml-ieqn-1149"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> of the random variable <inline-formula id="ieqn-1150"><mml:math id="mml-ieqn-1150"><mml:mi>X</mml:mi></mml:math></inline-formula>, or, conversely,</p>
<p><disp-formula id="eqn-189"><label>(189)</label><mml:math id="mml-eqn-189" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>X</mml:mi></mml:msqrt><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:msqrt><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>As opposed to the variance, the standard deviation of some random variable <inline-formula id="ieqn-1151"><mml:math id="mml-ieqn-1151"><mml:mi>X</mml:mi></mml:math></inline-formula> has the same physical dimension as <inline-formula id="ieqn-1152"><mml:math id="mml-ieqn-1152"><mml:mi>X</mml:mi></mml:math></inline-formula> itself.</p>
<p>The elementary relation <inline-formula id="ieqn-1153"><mml:math id="mml-ieqn-1153"><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:msup><mml:mi>X</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>X</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> gives</p>
<p><disp-formula id="eqn-190"><label>(190)</label><mml:math id="mml-eqn-190" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Note that the mean of inputs does not vanish for activation functions that are not symmetric about zero as, e.g., the ReLU functions (see Section <xref ref-type="sec" rid="s5_3_2">5.3.2</xref>). For the ReLU activation function, <inline-formula id="ieqn-1154"><mml:math id="mml-ieqn-1154"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the mean value of the squared output and the variance of the input are related by</p>
<p><disp-formula id="eqn-191"><label>(191)</label><mml:math id="mml-eqn-191" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msubsup><mml:msup><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi>y</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-192"><label>(192)</label><mml:math id="mml-eqn-192" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msubsup><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msubsup><mml:msup><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msubsup><mml:msup><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi>x</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-193"><label>(193)</label><mml:math id="mml-eqn-193" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>x</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Substituting the above result in Eq. (<xref ref-type="disp-formula" rid="eqn-190">190</xref>) provides the following relationship among the variances of the inputs to the activation function of two consecutive layers, i.e., <inline-formula id="ieqn-1155"><mml:math id="mml-ieqn-1155"><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1156"><mml:math id="mml-ieqn-1156"><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, respectively:</p>
<p><disp-formula id="eqn-194"><label>(194)</label><mml:math id="mml-eqn-194" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mn>2</mml:mn></mml:mfrac><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For a network with <inline-formula id="ieqn-1157"><mml:math id="mml-ieqn-1157"><mml:mi>L</mml:mi></mml:math></inline-formula> layers, the following relation between the variance of inputs <inline-formula id="ieqn-1158"><mml:math id="mml-ieqn-1158"><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and outputs <inline-formula id="ieqn-1159"><mml:math id="mml-ieqn-1159"><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is obtained:</p>
<p><disp-formula id="eqn-195"><label>(195)</label><mml:math id="mml-eqn-195" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mi>L</mml:mi></mml:munderover><mml:mfrac><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mn>2</mml:mn></mml:mfrac><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>To preserve the variance through all layers of the network, the following condition must be fulfilled regarding the variance of weight matrices:</p>
<p><disp-formula id="eqn-196"><label>(196)</label><mml:math id="mml-eqn-196" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mn>2</mml:mn></mml:mfrac><mml:mrow><mml:mrow><mml:mtext>Var</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mspace width="1em" /><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mi>&#x2113;</mml:mi><mml:mspace width="2em" /><mml:mo stretchy="false">&#x2194;</mml:mo><mml:mspace width="2em" /><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mfrac><mml:mn>2</mml:mn><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1160"><mml:math id="mml-ieqn-1160"><mml:mtext>&#x1D4A9;&#x00A0;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the normal (or Gaussian) distribution with zero mean and <inline-formula id="ieqn-1161"><mml:math id="mml-ieqn-1161"><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> variance. The above result, which is known as <italic>Kaiming He initialization</italic>, implies that the width of a layer <inline-formula id="ieqn-1162"><mml:math id="mml-ieqn-1162"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> needs to be regarded in the initialization of weight matrices. Preserving the variance of inputs mitigates exploding or vanishing gradients and improves convergence in particular for deep networks. The authors of [<xref ref-type="bibr" rid="ref-127">127</xref>] provided analogous results for the parametric rectified linear unit (PReLU, see Section <xref ref-type="sec" rid="s5_3_3">5.3.3</xref>).</p>
</sec>
<sec id="s6_5"><label>6.5</label>
<title>Adaptive methods: Adam, variants, criticism</title>
<p>The Adam algorithm was introduced in [<xref ref-type="bibr" rid="ref-170">170</xref>] (version 1), and updated in 2017 (version 9), and has been &#x201C;immensely successful in development of several state-of-the-art solutions for a wide range of problems,&#x201D; as stated in [<xref ref-type="bibr" rid="ref-182">182</xref>]. &#x201C;In the area of neural networks, the ADAM-Optimizer is one of the most popular adaptive step size methods. It was invented in [<xref ref-type="bibr" rid="ref-170">170</xref>]. The 5865 citations in only three years shows additionally the importance of the given paper&#x201D;<xref ref-type="fn" rid="fn175"><sup>175</sup></xref><fn id="fn175"><label>175</label><p>Reference [<xref ref-type="bibr" rid="ref-170">170</xref>] introduced the Adam algorithm in 2014, received 34,535 citations on 2019.12.11, after 5 years, and a whopping 112,797 citations on 2022.07.11, after an additional period of more than 2.5 years later, according to Google Scholar.</p></fn> [<xref ref-type="bibr" rid="ref-203">203</xref>]. The authors of [<xref ref-type="bibr" rid="ref-204">204</xref>] concurred: &#x201C;Adam is widely used in both academia and industry. However, it is also one of the least well-understood algorithms. In recent years, some remarkable works provided us with better understanding of the algorithm, and proposed different variants of it.&#x201D;</p>
<sec id="s6_5_1"><label>6.5.1</label>
<title>Unified adaptive learning-rate pseudocode</title>
<p>It was suggested in [<xref ref-type="bibr" rid="ref-182">182</xref>] a unified pseudocode, adapted in Algorithm <xref ref-type="fig" rid="fig-163">5</xref>, that included not only the standard SGD in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>, but also a number of successful adaptive learning-rate methods: <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>, the recent <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref>, <xref ref-type="sec" rid="s6_5_10">AdamW</xref>. Our adaptation in Algorithm <xref ref-type="fig" rid="fig-163">5</xref> also includes Nostalgic Adam and AdamX.<xref ref-type="fn" rid="fn176"><sup>176</sup></xref><fn id="fn176"><label>176</label><p>In lines 12-13 of Algorithm 5, use Eq. (<xref ref-type="disp-formula" rid="eqn-183">183</xref>), but replacing scalar learning rate <inline-formula id="ieqn-3240"><mml:math id="mml-ieqn-3240"><mml:msub><mml:mi>&#x2208;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> with matrix learning rate <inline-formula id="ieqn-3241"><mml:math id="mml-ieqn-3241"><mml:msub><mml:mrow><mml:mi mathvariant='bold-italic'>&#x03F5;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-200">200</xref>) to update parameters; the result includes Eq. (<xref ref-type="disp-formula" rid="eqn-201">201</xref>) for vanilla adaptive methods and Eq. (<xref ref-type="disp-formula" rid="eqn-251">251</xref>) for <xref ref-type="sec" rid="s6_5_10">AdamW</xref>.</p></fn></p>
<p>Four new quantities are introduced for iteration <inline-formula id="ieqn-1163"><mml:math id="mml-ieqn-1163"><mml:mi>k</mml:mi></mml:math></inline-formula> in SGD: (1) <inline-formula id="ieqn-1164"><mml:math id="mml-ieqn-1164"><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> at SGD iteration <inline-formula id="ieqn-1165"><mml:math id="mml-ieqn-1165"><mml:mi>k</mml:mi></mml:math></inline-formula>, as the first moment, and (2) its correction <inline-formula id="ieqn-1166"><mml:math id="mml-ieqn-1166"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula></p>
<p><disp-formula id="eqn-197"><label>(197)</label><mml:math id="mml-eqn-197" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and (3) the second moment (variance)<xref ref-type="fn" rid="fn177"><sup>177</sup></xref><fn id="fn177"><label>177</label><p>The uppercase letter <inline-formula id="ieqn-3242"><mml:math id="mml-ieqn-3242"><mml:mrow><mml:mi mathvariant='bold-italic'>V</mml:mi></mml:mrow></mml:math></inline-formula> is used instead of the lowercase letter <inline-formula id="ieqn-3243"><mml:math id="mml-ieqn-3243"><mml:mrow><mml:mi mathvariant='bold-italic'>v</mml:mi></mml:mrow></mml:math></inline-formula>, which is usually reserved for &#x201C;velocity&#x201D; used in a term called &#x201C;momentum&#x201D;, which is added to the gradient term to correct the descent direction. Such algorithm is called gradient descent with momentum in deep-learning optimization literature; see, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], Section <xref ref-type="sec" rid="s8_3_2">8.3.2</xref> Momentum, p. 288.</p></fn> <inline-formula id="ieqn-1167"><mml:math id="mml-ieqn-1167"><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> and (4) its correction <inline-formula id="ieqn-1168"><mml:math id="mml-ieqn-1168"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula></p>
<p><disp-formula id="eqn-198"><label>(198)</label><mml:math id="mml-eqn-198" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The descent direction estimate <inline-formula id="ieqn-1169"><mml:math id="mml-ieqn-1169"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> at SGD iteration <inline-formula id="ieqn-1170"><mml:math id="mml-ieqn-1170"><mml:mi>k</mml:mi></mml:math></inline-formula> for each training epoch is</p>
<p><disp-formula id="eqn-199"><label>(199)</label><mml:math id="mml-eqn-199" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The adaptive learning rate <inline-formula id="ieqn-1171"><mml:math id="mml-ieqn-1171"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is obtained from rescaling the fixed learning-rate schedule <inline-formula id="ieqn-1172"><mml:math id="mml-ieqn-1172"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, also called the &#x201C;global&#x201D; learning rate particularly when it is a constant, using the 2nd moment <inline-formula id="ieqn-1173"><mml:math id="mml-ieqn-1173"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> as follows:</p>
<p><disp-formula id="eqn-200"><label>(200)</label><mml:math id="mml-eqn-200" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:msqrt></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;or&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mtext>&#x00A0;(element-wise operations)</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
 <p>where <inline-formula id="ieqn-1179"><mml:math id="mml-ieqn-1179"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> can be either Eq. (<xref ref-type="disp-formula" rid="eqn-147">147</xref>)<xref ref-type="fn" rid="fn178"><sup>178</sup></xref><fn id="fn178"><label>178</label><p>Suggested in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 287.</p></fn> (which includes <inline-formula id="ieqn-1180"><mml:math id="mml-ieqn-1180"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo></mml:math></inline-formula> constant) or Eq. (<xref ref-type="disp-formula" rid="eqn-149">149</xref>);<xref ref-type="fn" rid="fn179"><sup>179</sup></xref><fn id="fn179"><label>179</label><p>Suggested in [<xref ref-type="bibr" rid="ref-182">182</xref>], p. 3.</p></fn> <inline-formula id="ieqn-1181"><mml:math id="mml-ieqn-1181"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> is a small number to avoid division by zero;<xref ref-type="fn" rid="fn180"><sup>180</sup></xref><fn id="fn180"><label>180</label><p>AdaDelta and RMSProp used the first form of Eq. (<xref ref-type="disp-formula" rid="eqn-200">200</xref>), with &#x0394; outside the square root, whereas AdaGrad and Adam used the second part, with &#x0394; inside the square root. AMSGrad, Nostalgic Adam, AdamX did not use &#x0394;, i.e., set <inline-formula id="ieqn-3247"><mml:math id="mml-ieqn-3247"><mml:mo>&#x0394;</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>.</p></fn> the operations (square root, addition, division) are element-wise, with both <inline-formula id="ieqn-1182"><mml:math id="mml-ieqn-1182"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1183"><mml:math id="mml-ieqn-1183"><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>6</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;to&#x00A0;</mml:mtext></mml:mrow><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (depending on the algorithm) being constants.</p>
<fig id="fig-163">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-163.tif"/>
</fig>
<statement id="st6_11"><title><xref ref-type="statement" rid="st6_11">Remark 6.11</xref>.</title>
<p>A particular case is the AdaDelta algorithm, in which <inline-formula id="ieqn-1184"><mml:math id="mml-ieqn-1184"><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-197">197</xref>) is the second moment for the network parameter increments <inline-formula id="ieqn-1185"><mml:math id="mml-ieqn-1185"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-1186"><mml:math id="mml-ieqn-1186"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> also Eq. (<xref ref-type="disp-formula" rid="eqn-197">197</xref>) the corrected gradient.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>All of the above arrays&#x2014;such as <inline-formula id="ieqn-1187"><mml:math id="mml-ieqn-1187"><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1188"><mml:math id="mml-ieqn-1188"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-197">197</xref>), <inline-formula id="ieqn-1189"><mml:math id="mml-ieqn-1189"><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1190"><mml:math id="mml-ieqn-1190"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-198">198</xref>), and <inline-formula id="ieqn-1191"><mml:math id="mml-ieqn-1191"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-199">199</xref>)&#x2014;together the resulting array <inline-formula id="ieqn-1192"><mml:math id="mml-ieqn-1192"><mml:msub><mml:mrow><mml:mi>&#x03F5;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-200">200</xref>) has the same structure as the network parameter array <inline-formula id="ieqn-1193"><mml:math id="mml-ieqn-1193"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-31">31</xref>), with <inline-formula id="ieqn-1194"><mml:math id="mml-ieqn-1194"><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-34">34</xref>) being the total number of parameters. The update of the network parameter estimate in <inline-formula id="ieqn-1195"><mml:math id="mml-ieqn-1195"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> is written as follows:</p>
<p><disp-formula id="eqn-201"><label>(201)</label><mml:math id="mml-eqn-201" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>&#x00A0;(element-wise operations)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the Hadamard operator <inline-formula id="ieqn-1196"><mml:math id="mml-ieqn-1196"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> (element-wise multiplication) is omitted to alleviate the notation.<xref ref-type="fn" rid="fn181"><sup>181</sup></xref><fn id="fn181"><label>181</label><p>There are no symbols similar to the Hadamard operator symbol &#x02299; for other operations such as square root, addition, and division, as implied in Eq. (<xref ref-type="disp-formula" rid="eqn-200">200</xref>), so there is no need to use the symbol &#x02299; just for multiplication.</p></fn></p> 
<statement id="st6_12"><title><xref ref-type="statement" rid="st6_12">Remark 6.12</xref>.</title>
<p>The element-wise operations in Eq. (<xref ref-type="disp-formula" rid="eqn-200">200</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-201">201</xref>) would allow each parameter in array <inline-formula id="ieqn-1197"><mml:math id="mml-ieqn-1197"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> to have its own learning rate, unlike in traditional deterministic optimization algorithms, such as in Algorithm <xref ref-type="fig" rid="fig-160">2</xref> or even in the Standard SGD Algorithm <xref ref-type="fig" rid="fig-162">4</xref>, where the same learning rate is applied to all parameters in <inline-formula id="ieqn-1198"><mml:math id="mml-ieqn-1198"><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>It remains to define the functions <inline-formula id="ieqn-1199"><mml:math id="mml-ieqn-1199"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1200"><mml:math id="mml-ieqn-1200"><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-197">197</xref>), and <inline-formula id="ieqn-1201"><mml:math id="mml-ieqn-1201"><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1202"><mml:math id="mml-ieqn-1202"><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-198">198</xref>) for each of the particular algorithms covered by the unified Algorithm <xref ref-type="fig" rid="fig-163">5</xref>.</p>
<p><bold>SGD.</bold> To obtain Algorithm <xref ref-type="fig" rid="fig-162">4</xref> as a particular case, select the following functions for Algorithm <xref ref-type="fig" rid="fig-163">5</xref>:</p>
<p><disp-formula id="eqn-202"><label>(202)</label><mml:math id="mml-eqn-202" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:mtext>&#x00A0;(Identity)</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-203"><label>(203)</label><mml:math id="mml-eqn-203" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>together with learning-rate schedule <inline-formula id="ieqn-1203"><mml:math id="mml-ieqn-1203"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> presented in Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref> on step-length decay and annealing. In other words, from <xref ref-type="disp-formula" rid="eqn-199">Eqs. (199)</xref>-<xref ref-type="disp-formula" rid="eqn-201">(201)</xref>, the parameter update reduces to that of the vanilla SGD with the fixed learning-rate schedule <inline-formula id="ieqn-1204"><mml:math id="mml-ieqn-1204"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, without scaling:</p>
<p><disp-formula id="eqn-204"><label>(204)</label><mml:math id="mml-eqn-204" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Similarly for SGD with momentum and accelerated gradient (Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref>), step-length decay and cyclic annealing (Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref>), weight decay (Section <xref ref-type="sec" rid="s6_3_6">6.3.6</xref>).</p>
<fig id="fig-68">
<label>Figure 68</label>
<caption><title><italic>Convergence of adaptive learning-rate algorithms</italic> (Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref>): <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_3_2">SGDNesterov</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref> [<xref ref-type="bibr" rid="ref-170">170</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-68.tif"/>
</fig>
</sec>
<sec id="s6_5_2"><label>6.5.2</label>
<title>AdaGrad: Adaptive Gradient</title>
<p>Starting the line of research on adaptive learning-rate algorithms, the authors of [<xref ref-type="bibr" rid="ref-52">52</xref>]<xref ref-type="fn" rid="fn182"><sup>182</sup></xref><fn id="fn182"><label>182</label><p>As of 2019.11.28, [<xref ref-type="bibr" rid="ref-52">52</xref>] was cited 5,385 times on Google Scholars, and 1,615 times on Web of Science. By 2022.07.11, [<xref ref-type="bibr" rid="ref-52">52</xref>] was cited 10,431 times on Google Scholars, and 3,871 times on Web of Science.</p></fn> selected the following functions for Algorithm <xref ref-type="fig" rid="fig-163">5</xref>:</p>
<p><disp-formula id="eqn-205"><label>(205)</label><mml:math id="mml-eqn-205" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:mtext>&#x00A0;(Identity)</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-206"><label>(206)</label><mml:math id="mml-eqn-206" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mtext>&#x00A0;(element-wise square)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-207"><label>(207)</label><mml:math id="mml-eqn-207" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mtext>&#x00A0;(element-wise operations)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>leading to an update with adaptive scaling of the learning rate</p>
<p><disp-formula id="eqn-208"><label>(208)</label><mml:math id="mml-eqn-208" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;(element-wise operations)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>in which each parameter in <inline-formula id="ieqn-1205"><mml:math id="mml-ieqn-1205"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is updated with its own learning rate. For a given network parameter, say, <inline-formula id="ieqn-1206"><mml:math id="mml-ieqn-1206"><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, its learning rate <inline-formula id="ieqn-1207"><mml:math id="mml-ieqn-1207"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>p</mml:mi><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is essentially <inline-formula id="ieqn-1208"><mml:math id="mml-ieqn-1208"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> scaled with the inverse of the square root of the sum of all historical values of the corresponding gradient component <inline-formula id="ieqn-1209"><mml:math id="mml-ieqn-1209"><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>q</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-1210"><mml:math id="mml-ieqn-1210"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo>+</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>p</mml:mi><mml:mi>q</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-1211"><mml:math id="mml-ieqn-1211"><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> being very small. A consequence of such scaling is that a larger gradient component would have a smaller learning rate and a smaller per-iteration decrease in the learning rate, whereas a smaller gradient component would have a larger learning rate and a higher per-iteration decrease in the learning rate, even though the relative decrease is about the same.<xref ref-type="fn" rid="fn183"><sup>183</sup></xref><fn id="fn183"><label>183</label><p>See [<xref ref-type="bibr" rid="ref-54">54</xref>]. For example, compare the sequence <inline-formula id="ieqn-3250"></inline-formula> to the sequence <inline-formula id="ieqn-3251"></inline-formula>. The authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 299, mistakenly stated &#x201C;The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate.&#x201D;</p></fn> Thus progress along different directions with large difference in gradient amplitudes is evened out as the number of iterations increases.<xref ref-type="fn" rid="fn184"><sup>184</sup></xref><fn id="fn184"><label>184</label><p>It was stated in [<xref ref-type="bibr" rid="ref-54">54</xref>]: &#x201C;progress along each dimension evens out over time. This is very beneficial for training deep neural networks since the scale of the gradients in each layer is often different by several orders of magnitude, so the optimal Learning rate should take that into account.&#x201D; Such observation made more sense than saying &#x201C;The net effect is greater progress in the more gently sloped directions of parameter space&#x201D; as did the authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 299, who referred to AdaDelta in Section 8.5.4, p. 302, through the work of other authors, but might not read [<xref ref-type="bibr" rid="ref-54">54</xref>].</p></fn></p>
<p>Figure <xref ref-type="fig" rid="fig-68">68</xref> shows the convergence of some adaptive learning-rate algorithms: <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_3_2">SGDNesterov</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>.</p>
</sec>
<sec id="s6_5_3"><label>6.5.3</label>
<title>Forecasting time series, exponential smoothing</title>
<p>At this point, all of the subsequent adaptive learning-rate algorithms made use of an important technique in forecasting known as exponential smoothing of time series, without using this terminology, but instead referred to such technique as &#x201C;exponential decaying average&#x201D; [<xref ref-type="bibr" rid="ref-54">54</xref>], [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 300, &#x201C;exponentially decaying average&#x201D; [<xref ref-type="bibr" rid="ref-80">80</xref>], &#x201C;exponential moving average&#x201D; [<xref ref-type="bibr" rid="ref-170">170</xref>], [<xref ref-type="bibr" rid="ref-182">182</xref>], [<xref ref-type="bibr" rid="ref-205">205</xref>], [<xref ref-type="bibr" rid="ref-162">162</xref>], &#x201C;exponential weight decay&#x201D; [<xref ref-type="bibr" rid="ref-56">56</xref>].</p>
<fig id="fig-69">
<label>Figure 69</label>
<caption><title><italic>Dow Jones Industrial Average</italic> (DJIA, Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref>) stock index year-to-date (YTD) chart as from 2019.01.01 to 2019.11.30, Google Finance.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-69.tif"/>
</fig>
<p>&#x201C;Exponential smoothing methods have been around since the 1950s, and are still the most popular forecasting methods used in business and industry&#x201D; such as &#x201C;minute-by-minute stock prices, hourly temperatures at a weather station, daily numbers of arrivals at a medical clinic, weekly sales of a product, monthly unemployment figures for a region, quarterly imports of a country, and annual turnover of a company&#x201D; [<xref ref-type="bibr" rid="ref-206">206</xref>]. See Figure <xref ref-type="fig" rid="fig-69">69</xref> for the chart of a stock index showing noise.</p>
<p>&#x201C;Exponential smoothing was proposed in the late 1950s (Brown, 1959; Holt, 1957; Winters, 1960), and has motivated some of the most successful forecasting methods. Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight. This framework generates reliable forecasts quickly and for a wide range of time series, which is a great advantage and of major importance to applications in industry&#x201D; [<xref ref-type="bibr" rid="ref-207">207</xref>], Chap. 7, &#x201C;Exponential smoothing&#x201D;. See Figure <xref ref-type="fig" rid="fig-70">70</xref> for an example of &#x201C;exponential-smoothing&#x201D; curve that is not &#x201C;smooth&#x201D;.</p>
<fig id="fig-70">
<label>Figure 70</label>
<caption><title><italic>Saudi Arabia oil production during 1996-2013</italic> (Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref>). Piecewise linear data (black) and fitted curve (red), despite the name &#x201C;smoothing&#x201D;. From [<xref ref-type="bibr" rid="ref-207">207</xref>], Chap. 7. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-70.tif"/>
</fig>
<p>For neural networks, early use of exponential smoothing dates back at least to 1998 in [<xref ref-type="bibr" rid="ref-165">165</xref>] and [<xref ref-type="bibr" rid="ref-166">166</xref>].<xref ref-type="fn" rid="fn185"><sup>185</sup></xref><fn id="fn185"><label>185</label><p>We thank Lawrence Aitchison for informing us about these references; see also [<xref ref-type="bibr" rid="ref-168">168</xref>].</p></fn></p>
<p>For adaptive learning-rate algorithms further below (<xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>, etc.), let <inline-formula id="ieqn-1212"><mml:math id="mml-ieqn-1212"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula> be a noisy raw-data time series as in Figure <xref ref-type="fig" rid="fig-69">69</xref>, and <inline-formula id="ieqn-1213"><mml:math id="mml-ieqn-1213"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula> its smoothed-out counterpart. The following recurrence relation is an exponential smoothing used to predict <inline-formula id="ieqn-1214"><mml:math id="mml-ieqn-1214"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> based on the known value <inline-formula id="ieqn-1215"><mml:math id="mml-ieqn-1215"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and the data <inline-formula id="ieqn-1216"><mml:math id="mml-ieqn-1216"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-209"><label>(209)</label><mml:math id="mml-eqn-209" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-209">209</xref>) is a convex combination between <inline-formula id="ieqn-1217"><mml:math id="mml-ieqn-1217"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-1218"><mml:math id="mml-ieqn-1218"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. a value of <inline-formula id="ieqn-1219"><mml:math id="mml-ieqn-1219"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> closer to 1, e.g., <inline-formula id="ieqn-1220"><mml:math id="mml-ieqn-1220"><mml:mi>&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.9</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-1221"><mml:math id="mml-ieqn-1221"><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula>, would weigh the smoothed-out past data <inline-formula id="ieqn-1222"><mml:math id="mml-ieqn-1222"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> more than the future raw data point <inline-formula id="ieqn-1223"><mml:math id="mml-ieqn-1223"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. From Eq. (<xref ref-type="disp-formula" rid="eqn-209">209</xref>), we have</p>
<p><disp-formula id="eqn-210"><label>(210)</label><mml:math id="mml-eqn-210" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-211"><label>(211)</label><mml:math id="mml-eqn-211" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="0em 0.3em" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-212"><label>(212)</label><mml:math id="mml-eqn-212" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the first term in Eq. (<xref ref-type="disp-formula" rid="eqn-212">212</xref>) is called the bias, which is set by the initial condition:</p>
<p><disp-formula id="eqn-213"><label>(213)</label><mml:math id="mml-eqn-213" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;and as&#x00A0;</mml:mtext></mml:mrow><mml:mi>t</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For finite time <inline-formula id="ieqn-1224"><mml:math id="mml-ieqn-1224"><mml:mi>t</mml:mi></mml:math></inline-formula>, if the series started with <inline-formula id="ieqn-1225"><mml:math id="mml-ieqn-1225"><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x2260;</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, then there is a need to correct the estimate for the non-zero bias <inline-formula id="ieqn-1226"><mml:math id="mml-ieqn-1226"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2260;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> (see Eq. (<xref ref-type="disp-formula" rid="eqn-227">227</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-229">229</xref>) in the Adam algorithm below). The coefficients in the series in Eq. (<xref ref-type="disp-formula" rid="eqn-212">212</xref>) are exponential functions, and thus the name &#x201C;exponential smoothing&#x201D;.</p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-212">212</xref>) is the discrete counterpart of the linear part of Volterra series in Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>), used widely in neuroscientific modeling; see <xref ref-type="statement" rid="st13_2">Remark 13.2</xref>. See also the &#x201C;small heavy sphere&#x201D; method or SGD with momentum Eq. (<xref ref-type="disp-formula" rid="eqn-145">145</xref>).</p>
<p>It should be noted, however, that for <italic>forecasting</italic> (e.g., [<xref ref-type="bibr" rid="ref-207">207</xref>]), the following recursive equation, slightly different from Eq. (<xref ref-type="disp-formula" rid="eqn-209">209</xref>), is used instead:</p>
<p><disp-formula id="eqn-214"><label>(214)</label><mml:math id="mml-eqn-214" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mstyle mathcolor="purple"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1227"><mml:math id="mml-ieqn-1227"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (shown in purple) is used instead of <inline-formula id="ieqn-1228"><mml:math id="mml-ieqn-1228"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, since if the data at <inline-formula id="ieqn-1229"><mml:math id="mml-ieqn-1229"><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> were already known, there would be no need to forecast.</p>
</sec>
<sec id="s6_5_4"><label>6.5.4</label>
<title>RMSProp: Root Mean Square Propagation</title>
<p>Since &#x201C;<xref ref-type="sec" rid="s6_5_2">AdaGrad</xref> shrinks the learning rate according to the entire history of the squared gradient and may have made the learning rate too small before arriving at such a convex structure&#x201D;,<xref ref-type="fn" rid="fn186"><sup>186</sup></xref><fn id="fn186"><label>186</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 300.</p></fn> The authors of [<xref ref-type="bibr" rid="ref-53">53</xref>]<xref ref-type="fn" rid="fn187"><sup>187</sup></xref><fn id="fn187"><label>187</label><p>Almost all authors, e.g., [<xref ref-type="bibr" rid="ref-80">80</xref>] [<xref ref-type="bibr" rid="ref-55">55</xref>] [<xref ref-type="bibr" rid="ref-162">162</xref>], attributed RMSProp to Ref. [<xref ref-type="bibr" rid="ref-53">53</xref>], except for Ref. [<xref ref-type="bibr" rid="ref-78">78</xref>], where only Hinton&#x2019;s 2012 Coursera lecture was referred to. Tieleman was Hinton&#x2019;s student; see the video and the lecture notes in [<xref ref-type="bibr" rid="ref-53">53</xref>], where Tieleman&#x2019;s contribution was noted as unpublished. The authors of [<xref ref-type="bibr" rid="ref-162">162</xref>] indicated that both RMSProp and AdaDelta (next section) were developed independently at about the same time to fix problems in AdaGrad.</p></fn> fixed the problem of continuing decay of the learning rate by introducing RMSProp<xref ref-type="fn" rid="fn188"><sup>188</sup></xref><fn id="fn188"><label>188</label><p>Neither [<xref ref-type="bibr" rid="ref-162">162</xref>], nor [<xref ref-type="bibr" rid="ref-80">80</xref>], nor [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 299, provided the meaning of the acronym RMSProp, which stands for &#x201C;Root Mean Square Propagation&#x201D;.</p></fn> with the following functions for Algorithm <xref ref-type="fig" rid="fig-163">5</xref>:</p>
<p><disp-formula id="eqn-215"><label>(215)</label><mml:math id="mml-eqn-215" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:mtext>&#x00A0;(Identity)</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-216"><label>(216)</label><mml:math id="mml-eqn-216" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;(element-wise square)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-217"><label>(217)</label><mml:math id="mml-eqn-217" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-218"><label>(218)</label><mml:math id="mml-eqn-218" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>6</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:msqrt></mml:mfrac><mml:mrow><mml:mtext>&#x00A0;(element-wise operations)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the running average of the squared gradients is given in Eq. (<xref ref-type="disp-formula" rid="eqn-216">216</xref>) for efficient coding, and in Eq. (<xref ref-type="disp-formula" rid="eqn-217">217</xref>) in fully expanded form as a series with exponential coefficients <inline-formula id="ieqn-1230"><mml:math id="mml-ieqn-1230"><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, for <inline-formula id="ieqn-1231"><mml:math id="mml-ieqn-1231"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>. Eq. (<xref ref-type="disp-formula" rid="eqn-216">216</xref>) is the exact counterpart of exponential smoothing recurrence relation in Eq. (<xref ref-type="disp-formula" rid="eqn-209">209</xref>), and Eq. (<xref ref-type="disp-formula" rid="eqn-217">217</xref>) has its counterpart in Eq. (<xref ref-type="disp-formula" rid="eqn-212">212</xref>) if <inline-formula id="ieqn-1232"><mml:math id="mml-ieqn-1232"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>0</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>; see Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref> on forecasting time series and exponential smoothing.</p>
<p>Figure <xref ref-type="fig" rid="fig-68">68</xref> shows the convergence of some adaptive learning-rate algorithms: <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_3_2">SGDNesterov</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>.</p>
<p><xref ref-type="sec" rid="s6_5_4">RMSProp</xref> still depends on a global learning rate <inline-formula id="ieqn-1233"><mml:math id="mml-ieqn-1233"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> constant, a tuning hyperparameter. Even though <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> was one of the go-to algorithms for machine learning, the pitfalls of <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, along with other adaptive learning-rate algorithms, were revealed in [<xref ref-type="bibr" rid="ref-55">55</xref>].</p>
</sec>
<sec id="s6_5_5"><label>6.5.5</label>
<title>AdaDelta: Adaptive Delta (parameter increment)</title>
<p>The name &#x201C;AdaDelta&#x201D; comes from the adaptive parameter increment <inline-formula id="ieqn-1234"><mml:math id="mml-ieqn-1234"><mml:mtext>&#x0394;</mml:mtext><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-223">223</xref>). In parallel and independently, AdaDelta proposed in [<xref ref-type="bibr" rid="ref-54">54</xref>] not only fixed the problem of continuing decaying learning rate of <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, but also removed the need for a global learning rate <inline-formula id="ieqn-1235"><mml:math id="mml-ieqn-1235"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>, which <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> still used. By accumulating the squares of the parameter increments, i.e., <inline-formula id="ieqn-1236"><mml:math id="mml-ieqn-1236"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, AdaDelta would fit in the unified framework of Algorithm <xref ref-type="fig" rid="fig-163">5</xref> if the symbol <inline-formula id="ieqn-1237"><mml:math id="mml-ieqn-1237"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">m</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-197">197</xref>) were interpreted as the accumulated 2nd moment <inline-formula id="ieqn-1238"><mml:math id="mml-ieqn-1238"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, per <xref ref-type="statement" rid="st6_11">Remark 6.11</xref>.</p>
<p>The weaknesses of <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref> was observed in [<xref ref-type="bibr" rid="ref-54">54</xref>]: &#x201C;Since the magnitudes of gradients are factored out in <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, this method can be sensitive to initial conditions of the parameters and the corresponding gradients. If the initial gradients are large, the learning rates will be low for the remainder of training. This can be combatted by increasing the global learning rate, making the <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref> method sensitive to the choice of learning rate. Also, due to the continual accumulation of squared gradients in the denominator, the learning rate will continue to decrease throughout training, eventually decreasing to zero and stopping training completely.&#x201D;</p>
<p>AdaDelta was then introduced in [<xref ref-type="bibr" rid="ref-54">54</xref>] as an improvement over AdaGrad with two goals in mind: (1) to avoid the continuing decay of the learning rate, and (2) to avoid having to specify <inline-formula id="ieqn-1239"><mml:math id="mml-ieqn-1239"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, called the &#x201C;global learning rate&#x201D;, as a constant. Instead of summing past squared gradients over a finite-size window, which is not efficient in coding, exponential smoothing was employed in [<xref ref-type="bibr" rid="ref-54">54</xref>] for both the squared gradients <inline-formula id="ieqn-1240"><mml:math id="mml-ieqn-1240"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> and for the squared increments <inline-formula id="ieqn-1241"><mml:math id="mml-ieqn-1241"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, with the increment used in the update <inline-formula id="ieqn-1242"><mml:math id="mml-ieqn-1242"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, by choosing the following functions for Algorithm <xref ref-type="fig" rid="fig-163">5</xref>:</p>
<p><disp-formula id="eqn-219"><label>(219)</label><mml:math id="mml-eqn-219" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;(element-wise square)</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-220"><label>(220)</label><mml:math id="mml-eqn-220" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-221"><label>(221)</label><mml:math id="mml-eqn-221" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mtext>&#x00A0;(element-wise square)</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-222"><label>(222)</label><mml:math id="mml-eqn-222" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Thus, exponential smoothing (Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref>) is used for two second-moment series: <inline-formula id="ieqn-1243"><mml:math id="mml-ieqn-1243"><mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow class="MJX-TeXAtom-ORD"><mml:mover><mml:mi>g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-1244"><mml:math id="mml-ieqn-1244"><mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula>. The update of the network parameters from <inline-formula id="ieqn-1245"><mml:math id="mml-ieqn-1245"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-1246"><mml:math id="mml-ieqn-1246"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is carried out as follows:</p>
<p><disp-formula id="eqn-223"><label>(223)</label><mml:math id="mml-eqn-223" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msqrt><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1247"><mml:math id="mml-ieqn-1247"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-218">218</xref>) is fixed to 1 in Eq. (<xref ref-type="disp-formula" rid="eqn-223">223</xref>), eliminating the hyperparameter <inline-formula id="ieqn-1248"><mml:math id="mml-ieqn-1248"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Another nice feature of AdaDelta is the consistency of units (physical dimensions), in the sense that the fraction factor of the gradient <inline-formula id="ieqn-1249"><mml:math id="mml-ieqn-1249"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-223">223</xref>) has the unit of step length (learning rate):<xref ref-type="fn" rid="fn189"><sup>189</sup></xref><fn id="fn189"><label>189</label><p>In spite of the nice features in <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, neither [<xref ref-type="bibr" rid="ref-78">78</xref>], nor [<xref ref-type="bibr" rid="ref-80">80</xref>], nor [<xref ref-type="bibr" rid="ref-162">162</xref>], had a review of <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, except for citing [<xref ref-type="bibr" rid="ref-54">54</xref>], even though the authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 302, wrote: &#x201C;While the results suggest that the family of algorithms with adaptive learning rates (represented by <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> and <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>) performed fairly robustly, no single best algorithm has emerged,&#x201D; and &#x201C;Currently, the most popular optimization algorithms actively in use include <xref ref-type="sec" rid="s6_3_1">SGD</xref>, <xref ref-type="sec" rid="s6_3_2">SGD with momentum</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> with momentum, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, and Adam.&#x201D; The authors of [<xref ref-type="bibr" rid="ref-80">80</xref>], p. 286, did not follow the historical development, briefly reviewed <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, then cited in passing references for <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref> and <xref ref-type="sec" rid="s6_5_6">Adam</xref>, and then mentioned the &#x201C;popular <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref> algorithm&#x201D; as a &#x201C;member of this family&#x201D;; readers would lose sight of the gradual progress made starting from <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, to <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, then <xref ref-type="sec" rid="s6_5_6">Adam</xref>, and to the recent <xref ref-type="sec" rid="s6_5_10">AdamW</xref> in [<xref ref-type="bibr" rid="ref-56">56</xref>], among others, alternating with pitfalls revealed and subsequent fixes, and then more pitfalls revealed and more fixes.</p></fn></p>
<p><disp-formula id="eqn-224"><label>(224)</label><mml:math id="mml-eqn-224" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:msqrt><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:mrow><mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the enclosing square brackets denote units (physical dimensions), but that was not the case in Eq. (<xref ref-type="disp-formula" rid="eqn-218">218</xref>) of RMSProp:</p>
<p><disp-formula id="eqn-225"><label>(225)</label><mml:math id="mml-eqn-225" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:msqrt></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2260;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Figure <xref ref-type="fig" rid="fig-68">68</xref> shows the convergence of some adaptive learning-rate algorithms: <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_3_2">SGDNesterov</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>.</p>
<p>Despite this progress, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref> and <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, along with other adaptive learning-rate algorithms, shared the same pitfalls as revealed in [<xref ref-type="bibr" rid="ref-55">55</xref>].</p>
</sec>
<sec id="s6_5_6"><label>6.5.6</label>
<title>Adam: Adaptive moments</title>
<p>Both both 1st moment Eq. (<xref ref-type="disp-formula" rid="eqn-226">226</xref>) and 2nd moment Eq. (<xref ref-type="disp-formula" rid="eqn-228">228</xref>) are adaptive. To avoid possible large step sizes and non-convergence of <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, the following functions were selected for Algorithm <xref ref-type="fig" rid="fig-163">5</xref> [<xref ref-type="bibr" rid="ref-170">170</xref>]:</p>
<p><disp-formula id="eqn-226"><label>(226)</label><mml:math id="mml-eqn-226" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-227"><label>(227)</label><mml:math id="mml-eqn-227" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>k</mml:mi></mml:msup></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;(bias correction)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-228"><label>(228)</label><mml:math id="mml-eqn-228" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mtext>&#x00A0;(element-wise square)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-229"><label>(229)</label><mml:math id="mml-eqn-229" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>k</mml:mi></mml:msup></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;(bias correction)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-230"><label>(230)</label><mml:math id="mml-eqn-230" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msqrt><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mtext>&#x00A0;(element-wise operations)</mml:mtext></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with the following recommended values of the parameters:</p>
<p><disp-formula id="eqn-231"><label>(231)</label><mml:math id="mml-eqn-231" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.9</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.999</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.001</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
 <statement id="st6_13"><title><xref ref-type="statement" rid="st6_13">Remark 6.13</xref>.</title>
<p><xref ref-type="sec" rid="s6_5_4">RMSProp</xref> is a particular case of Adam, when <inline-formula id="ieqn-1250"><mml:math id="mml-ieqn-1250"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, together with the absence of the bias-corrected 1st moment Eq. (<xref ref-type="disp-formula" rid="eqn-227">227</xref>) and bias-corrected 2nd moment Eq. (<xref ref-type="disp-formula" rid="eqn-229">229</xref>). Moreover, the get <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> from Adam, choose the constant <inline-formula id="ieqn-1251"><mml:math id="mml-ieqn-1251"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> and the learning rate <inline-formula id="ieqn-1252"><mml:math id="mml-ieqn-1252"><mml:msub><mml:mrow><mml:mi>&#x03F5;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-218">218</xref>), instead of Eq. (<xref ref-type="disp-formula" rid="eqn-230">230</xref>) above, but this choice is a minor point, since either choice should be fine. On the other hand, for deep-learning applications, having the 1st moment (or momentum), and thus requiring <inline-formula id="ieqn-1253"><mml:math id="mml-ieqn-1253"><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, would be useful to &#x201C;significantly boost the performance&#x201D; [<xref ref-type="bibr" rid="ref-182">182</xref>], and hence an advantage of Adam over <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>It follows from Eq. (<xref ref-type="disp-formula" rid="eqn-212">212</xref>) in Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref> on exponential smoothing of time series that the recurrence relation for gradients (1st moment) in Eq. (<xref ref-type="disp-formula" rid="eqn-226">226</xref>) leads to the following series:</p>
<p><disp-formula id="eqn-232"><label>(232)</label><mml:math id="mml-eqn-232" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>since <inline-formula id="ieqn-1254"><mml:math id="mml-ieqn-1254"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">m</mml:mi></mml:mrow><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>. Taking the expectation, as defined in Eq. (<xref ref-type="disp-formula" rid="eqn-67">67</xref>), on both sides of Eq. (<xref ref-type="disp-formula" rid="eqn-232">232</xref>) yields</p>
<p><disp-formula id="eqn-233"><label>(233)</label><mml:math id="mml-eqn-233" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1255"><mml:math id="mml-ieqn-1255"><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow></mml:math></inline-formula> is the drift from the expected value, with <inline-formula id="ieqn-1256"><mml:math id="mml-ieqn-1256"><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> for stationary random processes.<xref ref-type="fn" rid="fn190"><sup>190</sup></xref><fn id="fn190"><label>190</label><p>A random process is stationary when its mean and standard deviation stay constant over time.</p></fn> For non-stationary processes, it was suggested in [170] to keep <inline-formula id="ieqn-1257"><mml:math id="mml-ieqn-1257"><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow></mml:math></inline-formula> small by choosing small <inline-formula id="ieqn-1258"><mml:math id="mml-ieqn-1258"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> so only past gradients close to the present iteration <inline-formula id="ieqn-1259"><mml:math id="mml-ieqn-1259"><mml:mi>k</mml:mi></mml:math></inline-formula> would contribute, so to keep any change in the mean and standard deviation in subsequent iterations small. By dividing both sides by <inline-formula id="ieqn-1260"><mml:math id="mml-ieqn-1260"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, the bias-corrected 1st moment <inline-formula id="ieqn-1261"><mml:math id="mml-ieqn-1261"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> shown in Eq. (<xref ref-type="disp-formula" rid="eqn-227">227</xref>) is obtained, showing that the expected value of <inline-formula id="ieqn-1262"><mml:math id="mml-ieqn-1262"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is the same as the expected value of the gradient <inline-formula id="ieqn-1263"><mml:math id="mml-ieqn-1263"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> plus a small number, which could be zero for stationary processes:</p>
<p><disp-formula id="eqn-234"><label>(234)</label><mml:math id="mml-eqn-234" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The argument to obtain the bias-corrected 2nd moment <inline-formula id="ieqn-1264"><mml:math id="mml-ieqn-1264"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-229">229</xref>) is of course the same.</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-170">170</xref>] pointed out the lack of bias correction in <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> (<xref ref-type="statement" rid="st6_13">Remark 6.13</xref>), leading to &#x201C;very large step sizes and often divergence&#x201D;, and provided numerical experiment results to support their point.</p>
<p>Figure <xref ref-type="fig" rid="fig-68">68</xref> shows the convergence of some adaptive learning-rate algorithms: <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_3_2">SGDNesterov</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>. Their results show the superior performance of <xref ref-type="sec" rid="s6_5_6">Adam</xref> compared to other adaptive learning-rate algorithms. See Figure <xref ref-type="fig" rid="fig-151">151</xref> in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; in recent deep-learning papers.</p>

</sec>
<sec id="s6_5_7"><label>6.5.7</label>
<title>AMSGrad: Adaptive Moment Smoothed Gradient</title>
<p>The authors of [<xref ref-type="bibr" rid="ref-182">182</xref>] stated that <xref ref-type="sec" rid="s6_5_6">Adam</xref> (and other variants such as <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, Nadam) &#x201C;failed to converge to an optimal solution (or a critical point in non-convex settings)&#x201D; in many applications with large output spaces, and constructed a simple convex optimization for which <xref ref-type="sec" rid="s6_5_6">Adam</xref> did not converge to the optimal solution.</p>
<p>An earlier version of [<xref ref-type="bibr" rid="ref-182">182</xref>] received one of the three Best Papers at the ICLR 2018<xref ref-type="fn" rid="fn191"><sup>191</sup></xref><fn id="fn191"><label>191</label><p>Sixth International Conference on Learning Representations (<ext-link ext-link-type="uri" xlink:href="https://iclr.cc/Conferences/2018">Website</ext-link>).</p></fn> conference, in which it was suggested to fix the problem by endowing the mentioned algorithms with &#x201C;long-term memory&#x201D; of past gradients, and by selecting the following functions for Algorithm <xref ref-type="fig" rid="fig-163">5</xref>:</p>
<p><disp-formula id="eqn-235"><label>(235)</label><mml:math id="mml-eqn-235" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-236"><label>(236)</label><mml:math id="mml-eqn-236" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;or&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;or&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:mi>k</mml:mi></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-237"><label>(237)</label><mml:math id="mml-eqn-237" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;(no bias correction)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-238"><label>(238)</label><mml:math id="mml-eqn-238" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;(element-wise square)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-239"><label>(239)</label><mml:math id="mml-eqn-239" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mfrac><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:msqrt><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">&#x21D4;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x003E;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-240"><label>(240)</label><mml:math id="mml-eqn-240" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;(element-wise max, &#x201C;long-term memory&#x201D;)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mtext>&#x00A0;not used</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The parameter <inline-formula id="ieqn-1265"><mml:math id="mml-ieqn-1265"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> was not defined in Corollary 1 of [<xref ref-type="bibr" rid="ref-182">182</xref>]; such omission could create some difficulty for first-time leaners. It has to be deduced from reading the Corollary that <inline-formula id="ieqn-1266"><mml:math id="mml-ieqn-1266"><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. For the step-length (or size) schedule <inline-formula id="ieqn-1267"><mml:math id="mml-ieqn-1267"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, even though only Eq. (<xref ref-type="disp-formula" rid="eqn-149">149</xref>) was considered in [<xref ref-type="bibr" rid="ref-182">182</xref>] for the convergence proofs, Eq. (<xref ref-type="disp-formula" rid="eqn-147">147</xref>) (which includes <inline-formula id="ieqn-1268"><mml:math id="mml-ieqn-1268"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo></mml:math></inline-formula> constant) and Eq. (<xref ref-type="disp-formula" rid="eqn-150">150</xref>) could also be used.<xref ref-type="fn" rid="fn192"><sup>192</sup></xref><fn id="fn192"><label>192</label><p>The authors of [<xref ref-type="bibr" rid="ref-182">182</xref>] distinguished the step size <inline-formula id="ieqn-3252"><mml:math id="mml-ieqn-3252"><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> from the scaled step size <inline-formula id="ieqn-3253"><mml:math id="mml-ieqn-3253"><mml:msub><mml:mrow><mml:mi mathvariant='bold-italic'>&#x03F5;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-147">147</xref>) or Eq. (<xref ref-type="disp-formula" rid="eqn-149">149</xref>), which were called learning rate.</p></fn></p>
<p>First-time learners in this field could be overwhelmed by complex-looking equations in this kind of paper, so it would be helpful to elucidate some key results that led to the above expressions, particularly for <inline-formula id="ieqn-1269"><mml:math id="mml-ieqn-1269"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, which can be a constant or a function of the iteration number <inline-formula id="ieqn-1270"><mml:math id="mml-ieqn-1270"><mml:mi>k</mml:mi></mml:math></inline-formula>, in Eq. (<xref ref-type="disp-formula" rid="eqn-236">236</xref>).</p>
<p>It was stated in [<xref ref-type="bibr" rid="ref-182">182</xref>] that &#x201C;one typically uses a constant <inline-formula id="ieqn-1271"><mml:math id="mml-ieqn-1271"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in practice (although, the proof requires a decreasing schedule for proving convergence of the algorithm),&#x201D; and hence the first choice <inline-formula id="ieqn-1272"><mml:math id="mml-ieqn-1272"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<p>The second choice <inline-formula id="ieqn-1273"><mml:math id="mml-ieqn-1273"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-1274"><mml:math id="mml-ieqn-1274"><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-1275"><mml:math id="mml-ieqn-1275"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msqrt><mml:mi>k</mml:mi></mml:msqrt></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-149">149</xref>), was the result stated in Corollary 1 in [<xref ref-type="bibr" rid="ref-182">182</xref>], but without proof. We fill this gap here to explain this unusual expression for <inline-formula id="ieqn-1276"><mml:math id="mml-ieqn-1276"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Only the second term on the right-hand side of the inequality in Theorem 4 needs to be bounded by this choice of <inline-formula id="ieqn-1277"><mml:math id="mml-ieqn-1277"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and is written in our notation as:<xref ref-type="fn" rid="fn193"><sup>193</sup></xref><fn id="fn193"><label>193</label><p>To write this term in the notation used in[<xref ref-type="bibr" rid="ref-182">182</xref>], Theorem 4 and Corollary 1, simply make the following changes in notation: <inline-formula id="ieqn-3254"><mml:math id="mml-ieqn-3254"><mml:mi>k</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>t</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-3255"><mml:math id="mml-ieqn-3255"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>T</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-3256"><mml:math id="mml-ieqn-3256"><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>d</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-3257"><mml:math id="mml-ieqn-3257"><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>v</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-3258"><mml:math id="mml-ieqn-3258"><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula>.</p></fn></p>
<p><disp-formula id="eqn-241"><label>(241)</label><mml:math id="mml-eqn-241" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:msubsup><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mover><mml:mi>V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1278"><mml:math id="mml-ieqn-1278"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the maximum number of iterations in the &#x2461; <bold>for</bold> loop in Algorithm <xref ref-type="fig" rid="fig-163">5</xref>, and <inline-formula id="ieqn-1279"><mml:math id="mml-ieqn-1279"><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:math></inline-formula> is the total number of network parameters defined in Eq. (<xref ref-type="disp-formula" rid="eqn-34">34</xref>). The factor <inline-formula id="ieqn-1280"><mml:math id="mml-ieqn-1280"><mml:msubsup><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> is a constant, and the following bound on the component <inline-formula id="ieqn-1281"><mml:math id="mml-ieqn-1281"><mml:msubsup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is a consequence of an assumption in Theorem 4 in [<xref ref-type="bibr" rid="ref-182">182</xref>]:</p>
<p><disp-formula id="eqn-242"><label>(242)</label><mml:math id="mml-eqn-242" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:msub><mml:mi>J</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mo>&#x2225;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;for all&#x00A0;</mml:mtext></mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;for any&#x00A0;</mml:mtext></mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1282"><mml:math id="mml-ieqn-1282"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub></mml:math></inline-formula> is the infinity (max) norm, which is clearly consistent with the use of element-wise maximum components for &#x201C;long-term memory&#x201D; in Eq. (<xref ref-type="disp-formula" rid="eqn-240">240</xref>). Intuitively, <inline-formula id="ieqn-1283"><mml:math id="mml-ieqn-1283"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> has the unit of gradient squared, and thus <inline-formula id="ieqn-1284"><mml:math id="mml-ieqn-1284"><mml:msubsup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> has the unit of gradient. Since <inline-formula id="ieqn-1285"><mml:math id="mml-ieqn-1285"><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub></mml:math></inline-formula> is the upperbound of the maximum component of the gradient<xref ref-type="fn" rid="fn194"><sup>194</sup></xref><fn id="fn194"><label>194</label><p>The notation &#x201C;<inline-formula id="ieqn-3259"><mml:math id="mml-ieqn-3259"><mml:mi>G</mml:mi></mml:math></inline-formula>&#x201D; is clearly mnemonic for &#x201C;gradient&#x201D;, and the uppercase is used to designate upperbound.</p></fn> <inline-formula id="ieqn-1286"><mml:math id="mml-ieqn-1286"><mml:mo>&#x2207;</mml:mo><mml:msub><mml:mi>J</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> of the cost function <inline-formula id="ieqn-1287"><mml:math id="mml-ieqn-1287"><mml:msub><mml:mi>J</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> at any iteration <inline-formula id="ieqn-1288"><mml:math id="mml-ieqn-1288"><mml:mi>k</mml:mi></mml:math></inline-formula>, it follows that <inline-formula id="ieqn-1289"><mml:math id="mml-ieqn-1289"><mml:msubsup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub></mml:math></inline-formula>. We refer to [<xref ref-type="bibr" rid="ref-183">183</xref>], p. 10, Lemma 4.2, for a formal proof of the inequality in Eq. (<xref ref-type="disp-formula" rid="eqn-242">242</xref>). Once <inline-formula id="ieqn-1290"><mml:math id="mml-ieqn-1290"><mml:msubsup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is replaced by <inline-formula id="ieqn-1291"><mml:math id="mml-ieqn-1291"><mml:msubsup><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>, which is then pulled out as a common factor in expression (<xref ref-type="disp-formula" rid="eqn-241">241</xref>), where upon substituting <inline-formula id="ieqn-1293"><mml:math id="mml-ieqn-1293"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-1294"><mml:math id="mml-ieqn-1294"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-1295"><mml:math id="mml-ieqn-1295"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msqrt><mml:mi>k</mml:mi></mml:msqrt></mml:math></inline-formula>, we obtain:</p>
<p><disp-formula id="eqn-243"><label>(243)</label><mml:math id="mml-eqn-243" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">(</mml:mo><mml:mn>241</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>&#x2264;</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msqrt><mml:mi>k</mml:mi></mml:msqrt><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:msqrt><mml:mi>k</mml:mi></mml:msqrt><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-244"><label>(244)</label><mml:math id="mml-eqn-244" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>&#x2264;</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mi>k</mml:mi><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mrow><mml:msub><mml:mstyle mathcolor="blue"><mml:mi>&#x03B2;</mml:mi></mml:mstyle><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mstyle mathcolor="purple"><mml:mi>P</mml:mi></mml:mstyle><mml:mstyle mathcolor="purple"><mml:mi>T</mml:mi></mml:mstyle></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:msub><mml:mstyle mathcolor="purple"><mml:mi>&#x03F5;</mml:mi></mml:mstyle><mml:mstyle mathcolor="purple"><mml:mn>0</mml:mn></mml:mstyle></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the following series expansion had been used:<xref ref-type="fn" rid="fn195"><sup>195</sup></xref><fn id="fn195"><label>195</label><p>See also [<xref ref-type="bibr" rid="ref-183">183</xref>], p. 4, Lemma 2.4.</p></fn></p>
<p><disp-formula id="eqn-245"><label>(245)</label><mml:math id="mml-eqn-245" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mi>k</mml:mi><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Comparing the bound on the right-hand side of (244) to the corresponding bound shown in [<xref ref-type="bibr" rid="ref-182">182</xref>], Corollary 1, second term, it can be seen that two factors, <inline-formula id="ieqn-1296"><mml:math id="mml-ieqn-1296"><mml:mrow><mml:msub><mml:mstyle mathcolor="black"><mml:mi>P</mml:mi></mml:mstyle><mml:mi>T</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-1297"><mml:math id="mml-ieqn-1297"><mml:mrow><mml:msub><mml:mstyle mathcolor="black"><mml:mi>&#x03F5;</mml:mi></mml:mstyle><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> (in purple), were missing in the numerator and in the denominator, respectively. In addition, there should be no factor <inline-formula id="ieqn-1298"><mml:math id="mml-ieqn-1298"><mml:mrow><mml:msub><mml:mstyle mathcolor="black"><mml:mi>&#x03B2;</mml:mi></mml:mstyle><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> (in blue) as pointed out in [<xref ref-type="bibr" rid="ref-183">183</xref>] in their correction of the proof in [<xref ref-type="bibr" rid="ref-182">182</xref>].</p>
<p>On the other hand, there were some slight errors in the theorem statements and in the proofs in [<xref ref-type="bibr" rid="ref-182">182</xref>] that were corrected in [<xref ref-type="bibr" rid="ref-183">183</xref>], whose authors did a good job of not skipping any mathematical details that rendered the understanding and the verification of the proofs obscure and time consuming. It is then recommended to read [<xref ref-type="bibr" rid="ref-182">182</xref>] to get a general idea on the main convergence results of AMSGrad, then read [<xref ref-type="bibr" rid="ref-183">183</xref>] for the details, together with their variant of AMSGrad called <xref ref-type="sec" rid="s6_5_8">AdamX</xref>.</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-203">203</xref>], like those of [<xref ref-type="bibr" rid="ref-182">182</xref>], pointed out errors in the convergence proof in [<xref ref-type="bibr" rid="ref-170">170</xref>], and proposed a fix to this proof, but did not suggest any new variant of <xref ref-type="sec" rid="s6_5_6">Adam</xref>.</p>
<p>In the two large numerical experiments on the MNIST dataset in Figure <xref ref-type="fig" rid="fig-71">71</xref>,<xref ref-type="fn" rid="fn196"><sup>196</sup></xref><fn id="fn196"><label>196</label><p>See a description of the NMIST dataset in Section <xref ref-type="sec" rid="s5_3">5.3</xref> on &#x201C;Vanishing and exploding gradients&#x201D;. For the difference between logistic regression and neural network, see, e.g., [<xref ref-type="bibr" rid="ref-208">208</xref>], Raschka, &#x201C;Machine Learning FAQ: What is the relation between Logistic Regression and Neural Networks and when to use which?&#x201D; <ext-link ext-link-type="uri" xlink:href="https://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://web.archive.org/web/20161204204235/https://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html">Internet archive</ext-link>. See also [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 200, Figure6.8b, for the computational graph of Logistic Regression (one-layer network).</p></fn> the authors of [<xref ref-type="bibr" rid="ref-182">182</xref>] used contant <inline-formula id="ieqn-1299"><mml:math id="mml-ieqn-1299"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.9</mml:mn></mml:math></inline-formula>, with <inline-formula id="ieqn-1300"><mml:math id="mml-ieqn-1300"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>0.99</mml:mn><mml:mo>,</mml:mo><mml:mn>0.999</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>; they chose the step size schedule <inline-formula id="ieqn-1301"><mml:math id="mml-ieqn-1301"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msqrt><mml:mi>k</mml:mi></mml:msqrt></mml:math></inline-formula> in the logistic regression experiment, and constant step size <inline-formula id="ieqn-1302"><mml:math id="mml-ieqn-1302"><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> in a network with three layers (input, hidden, output). There was no single set of optimal parameters, which appeared to be problem dependent.</p>
<fig id="fig-71">
<label>Figure 71</label>
<caption><title><italic>AMSGrad vs Adam, numerical examples</italic> (Sections <xref ref-type="sec" rid="s6_1">6.1</xref>, <xref ref-type="sec" rid="s6_5_7">6.5.7</xref>). The MNIST dataset is used. The first two figures on the left were the results of using logistic regression (network with one layer with logistic sigmoid activation function), whereas the figure on the right is by using a neural network with three layers (input layer, hidden layer, output layer). The cost function decreased faster for AMSGrad compared to that of Adam. For logistic regression, the difference between the two cost values also decreased with the iteration number, and became very small at iteration 5000. For the three-layer neural network, the cost difference between AMSGrad and Adam stayed more or less constant, as the cost went down to more than one tenth of the initial cost at about 0.3, and after 5000 iterations, the AMSGrad cost (<inline-formula id="ieqn-701"><mml:math id="mml-ieqn-701"><mml:mo>&#x2248;</mml:mo><mml:mn>0.01</mml:mn></mml:math></inline-formula>) was about 50% of the Adam cost (<inline-formula id="ieqn-702"><mml:math id="mml-ieqn-702"><mml:mo>&#x2248;</mml:mo><mml:mn>0.02</mml:mn></mml:math></inline-formula>). See [<xref ref-type="bibr" rid="ref-182">182</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-71.tif"/>
</fig>
<p>The authors of [<xref ref-type="bibr" rid="ref-182">182</xref>] also did not provide any numerical example with <inline-formula id="ieqn-1303"><mml:math id="mml-ieqn-1303"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>; such numerical examples can be found, however, in [<xref ref-type="bibr" rid="ref-183">183</xref>] in connection with <xref ref-type="sec" rid="s6_5_8">AdamX</xref> below.</p>
<p>Unfortunately, when comparing <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref> to <xref ref-type="sec" rid="s6_5_6">Adam</xref> and <xref ref-type="sec" rid="s6_5_10">AdamW</xref> (further below), it was remarked in [<xref ref-type="bibr" rid="ref-209">209</xref>] that <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref> generated &#x201C;a lot of noise for nothing&#x201D;, meaning AMSGrad did not live up to its potential and best-paper award when tested on &#x201C;real-life problems&#x201D;.</p>
</sec>
<sec id="s6_5_8"><label>6.5.8</label>
<title>AdamX and Nostalgic Adam</title>
<p><bold>AdamX.</bold> The authors of [<xref ref-type="bibr" rid="ref-183">183</xref>], already mentioned above in connection to errors in the proofs in [<xref ref-type="bibr" rid="ref-182">182</xref>] for <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref>, also pointed out errors in the proofs by [<xref ref-type="bibr" rid="ref-170">170</xref>] (Theorem 10.5), [<xref ref-type="bibr" rid="ref-203">203</xref>] (Theorem 4.4), and by others, and suggested a fix for these proofs, and a new variant of <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref> called AdamX.</p>
<p>Reference [<xref ref-type="bibr" rid="ref-183">183</xref>] is more convenient to read, compared to [<xref ref-type="bibr" rid="ref-182">182</xref>], as the authors provided all mathematical details for the proofs, without skipping important details.</p>
<p>A slight change to Eq. (<xref ref-type="disp-formula" rid="eqn-240">240</xref>) was proposed in [183] as follows:</p>
<p><disp-formula id="eqn-246"><label>(246)</label><mml:math id="mml-eqn-246" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mtext>&#x00A0;for&#x00A0;</mml:mtext></mml:mrow><mml:mi>K</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>2</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In addition, numerical examples were provided in [<xref ref-type="bibr" rid="ref-183">183</xref>] with <inline-formula id="ieqn-1304"><mml:math id="mml-ieqn-1304"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-1305"><mml:math id="mml-ieqn-1305"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.9</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-1306"><mml:math id="mml-ieqn-1306"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.001</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-1307"><mml:math id="mml-ieqn-1307"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.999</mml:mn></mml:math></inline-formula>, and <inline-formula id="ieqn-1308"><mml:math id="mml-ieqn-1308"><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-200">200</xref>), even though the pseudocode did not use <inline-formula id="ieqn-1309"><mml:math id="mml-ieqn-1309"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> (or set <inline-formula id="ieqn-1310"><mml:math id="mml-ieqn-1310"><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>). The authors of [<xref ref-type="bibr" rid="ref-183">183</xref>] showed that both <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref> and AdamX converged with similar results, thus supporting their theoretical investigation, in particular, correcting the errors in the proofs of [<xref ref-type="bibr" rid="ref-182">182</xref>].</p>
<p><bold>Nostalgic Adam.</bold> The authors of [<xref ref-type="bibr" rid="ref-204">204</xref>] also fixed the non-convergence of Adam by introducing &#x201C;long-term memory&#x201D; to the second-moment of the gradient estimates, similar to the work in [<xref ref-type="bibr" rid="ref-182">182</xref>] on <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref> and in [<xref ref-type="bibr" rid="ref-183">183</xref>] on <xref ref-type="sec" rid="s6_5_8">AdamX</xref>.</p>
<p>There are many more variants of <xref ref-type="sec" rid="s6_5_6">Adam</xref>. But how are <xref ref-type="sec" rid="s6_5_6">Adam</xref> and its variants compared to good old SGD with new <xref ref-type="sec" rid="s6_3_1">add-on tricks</xref> ? (See the end of Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref>)</p>
<fig id="fig-72">
<label>Figure 72</label>
<caption><title><italic>Overfitting</italic> (Section <xref ref-type="sec" rid="s6_5_9">6.5.9</xref>, <xref ref-type="sec" rid="s6_5_10">6.5.10</xref>). <italic>Left:</italic> Underfitting with 1st-order polynomial. Middle: Appropriate fitting with 2nd-order polynomial. <italic>Right:</italic> Overfitting with 9th-order polynomial. See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 110, Figure5.2. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-72.tif"/></fig>
</sec>
<sec id="s6_5_9"><label>6.5.9</label>
<title>Criticism of adaptive methods, resurgence of SGD</title>
<p>Yet, despite the claim that <xref ref-type="sec" rid="s6_5_4">RMSProp</xref> is &#x201C;currently one of the go-to optimization methods being employed routinely by deep learning practitioners,&#x201D; and that &#x201C;currently, the most popular optimization algorithms actively in use include <xref ref-type="sec" rid="s6_3_2">SGD</xref>, <xref ref-type="sec" rid="s6_3_2">SGD with momentum</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, RMSProp with momentum, <xref ref-type="sec" rid="s6_5_5">AdaDelta</xref>, and <xref ref-type="sec" rid="s6_5_6">Adam</xref>&#x201D;,<xref ref-type="fn" rid="fn197"><sup>197</sup></xref><fn id="fn197"><label>197</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], pp. 301-302.</p></fn> the authors of [<xref ref-type="bibr" rid="ref-55">55</xref>] through their numerical experiments, that adaptivity can overfit (Figure <xref ref-type="fig" rid="fig-72">72</xref>), and that standard SGD with step-size tuning performed better than adaptive learning-rate algorithms such as <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, and <xref ref-type="sec" rid="s6_5_6">Adam</xref>. The total number of parameters, <inline-formula id="ieqn-1311"><mml:math id="mml-ieqn-1311"><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-34">34</xref>), in deep networks could easily exceed 25 times the number of output targets <inline-formula id="ieqn-1312"><mml:math id="mml-ieqn-1312"><mml:mi>m</mml:mi></mml:math></inline-formula> (Figure <xref ref-type="fig" rid="fig-23">23</xref>), i.e.,<xref ref-type="fn" rid="fn198"><sup>198</sup></xref><fn id="fn198"><label>198</label><p>See [<xref ref-type="bibr" rid="ref-55">55</xref>], p. 4, Section 3.3 &#x201C;Adaptivity can overfit&#x201D;.</p></fn></p>
<p><disp-formula id="eqn-247"><label>(247)</label><mml:math id="mml-eqn-247" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo>&#x2265;</mml:mo><mml:mn>25</mml:mn><mml:mi>m</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>making it prone to overfit without employing special techniques such as regularization or weight decay (see <xref ref-type="sec" rid="s6_5_10">AdamW</xref> below).</p>
<p>It was observed in [<xref ref-type="bibr" rid="ref-55">55</xref>] that adaptive methods tended to have larger generalization (test) errors<xref ref-type="fn" rid="fn199"><sup>199</sup></xref><fn id="fn199"><label>199</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 107, regarding training error and test (generalization) error. &#x201C;The ability to perform well on previously unobserved inputs is called generalization.&#x201D;</p></fn> compared to SGD: &#x201C;We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance,&#x201D; (see Figure <xref ref-type="fig" rid="fig-73">73</xref>), and concluded that:</p>
<disp-quote><p>&#x201C;Despite the fact that our experimental evidence demonstrates that adaptive methods are not advantageous for machine learning, the <xref ref-type="sec" rid="s6_5_6">Adam</xref> algorithm remains incredibly popular. We are not sure exactly as to why, but hope that our step-size tuning suggestions make it easier for practitioners to use standard stochastic gradient methods in their research.&#x201D;</p>
</disp-quote><p>The work of [<xref ref-type="bibr" rid="ref-55">55</xref>] has encouraged researchers who were enthusiastic with adaptive methods to take a fresh look at SGD again to tease something more out of this classic method.<xref ref-type="fn" rid="fn200"><sup>200</sup></xref><fn id="fn200"><label>200</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-210">210</xref>], where [<xref ref-type="bibr" rid="ref-55">55</xref>] was not referred to directly, but through a reference to [<xref ref-type="bibr" rid="ref-164">164</xref>], in which there was a reference to [<xref ref-type="bibr" rid="ref-55">55</xref>].</p></fn></p>
<fig id="fig-73">
<label>Figure 73</label>
<caption><title><italic><xref ref-type="sec" rid="s6_3_1">Standard SGD</xref> and <xref ref-type="sec" rid="s6_3_2">SGD with momentum</xref> vs. <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref> on CIFAR-10 dataset</italic> (Sections <xref ref-type="sec" rid="s6_1">6.1</xref>, <xref ref-type="sec" rid="s6_3_2">6.3.2</xref>, <xref ref-type="sec" rid="s6_5_9">6.5.9</xref>). From [<xref ref-type="bibr" rid="ref-55">55</xref>], where a method for step-size tuning and step-size decaying was proposed to achieve lowest training error and generalization (test) error for both <xref ref-type="sec" rid="s6_3_1">Standard SGD</xref> and <xref ref-type="sec" rid="s6_3_2">SGD with momentum</xref> (&#x201C;Heavy Ball&#x201D; or better yet &#x201C;<xref ref-type="sec" rid="s6_3_2">Small Heavy Sphere</xref>&#x201D; method) compared to adaptive methods such as <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_4">RMSProp</xref>, <xref ref-type="sec" rid="s6_5_6">Adam</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-73.tif"/>
</fig>
</sec>
<sec id="s6_5_10"><label>6.5.10</label>
<title>AdamW: Adaptive moment with weight decay</title>
<p>The authors of [<xref ref-type="bibr" rid="ref-56">56</xref>], aware of the work in [<xref ref-type="bibr" rid="ref-55">55</xref>], wrote: It was suggested in [<xref ref-type="bibr" rid="ref-55">55</xref>] &#x201C;that adaptive gradient methods do not generalize as well as SGD with momentum when tested on a diverse set of deep learning tasks, such as image classification, character-level language modeling and constituency parsing.&#x201D; In particular, it was shown in [<xref ref-type="bibr" rid="ref-56">56</xref>] that &#x201C; a major factor of the poor generalization of the most popular adaptive gradient method, <xref ref-type="sec" rid="s6_5_6">Adam</xref>, is due to the fact that <inline-formula id="ieqn-1313"><mml:math id="mml-ieqn-1313"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> regularization is not nearly as effective for it as for SGD,&#x201D; and proposed to move the weight decay from the gradient (<inline-formula id="ieqn-1314"><mml:math id="mml-ieqn-1314"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> regularization) to the parameter update (original weight decay regularization). So what is &#x201C;<inline-formula id="ieqn-1315"><mml:math id="mml-ieqn-1315"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> regularization&#x201D; and what is &#x201C;weight decay&#x201D; ? (See also Section <xref ref-type="sec" rid="s6_3_6">6.3.6</xref> on weight decay.)</p>
<p>Briefly, &#x201C;<inline-formula id="ieqn-1316"><mml:math id="mml-ieqn-1316"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> regularization&#x201D; is aiming at decreasing the overall weights to avoid overfitting, which simply means that the network model tries to fit through as many data points as possible, including noise.<xref ref-type="fn" rid="fn201"><sup>201</sup></xref><fn id="fn201"><label>201</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 107, Section <xref ref-type="sec" rid="s5_2">5.2</xref> on &#x201C;Capacity, overfitting and underfitting&#x201D;, and p. 115 provides a good explanation and motivation for regularization, as in Gupta 2017, &#x2018;Deep Learning: Overfitting&#x2019;, 2017.02.12, <ext-link ext-link-type="uri" xlink:href="https://towardsdatascience.com/deep-learning-overfitting-846bf5b35e24">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190201174735/https://towardsdatascience.com/deep-learning-overfitting-846bf5b35e24?gi=f70587e2080">Internet archive</ext-link>.</p></fn></p>
<p>
<disp-formula id="eqn-248"><label>(248)</label><mml:math id="mml-eqn-248" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:msub><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The magnitude of the coefficient <inline-formula id="ieqn-1317"><mml:math id="mml-ieqn-1317"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow></mml:math></inline-formula> regulates (or regularizes) the behavior of the network: <inline-formula id="ieqn-1318"><mml:math id="mml-ieqn-1318"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> would lead to overfitting (Figure <xref ref-type="fig" rid="fig-72">72</xref> right), a moderate <inline-formula id="ieqn-1319"><mml:math id="mml-ieqn-1319"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow></mml:math></inline-formula> may yield appropriate fitting (Figure <xref ref-type="fig" rid="fig-72">72</xref> middle), a large <inline-formula id="ieqn-1320"><mml:math id="mml-ieqn-1320"><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow></mml:math></inline-formula> may lead to underfitting (Figure <xref ref-type="fig" rid="fig-72">72</xref> left). The gradient of the regularized cost <inline-formula id="ieqn-1321"><mml:math id="mml-ieqn-1321"><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is then</p>
<p><disp-formula id="eqn-249"><label>(249)</label><mml:math id="mml-eqn-249" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>:=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and the update becomes:</p>
<p><disp-formula id="eqn-250"><label>(250)</label><mml:math id="mml-eqn-250" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-74">
<label>Figure 74</label>
<caption><title><italic>AdamW vs Adam, SGD, and variants on CIFAR-10 dataset</italic> (Sections <xref ref-type="sec" rid="s6_1">6.1</xref>, <xref ref-type="sec" rid="s6_5_10">6.5.10</xref>). While AdamW achieved lowest training loss (error) after 1800 epochs, the results showed that SGD with weight decay (SGDW) and with warm restart (SGDWR) achieved lower test (generalization) errors than <xref ref-type="sec" rid="s6_5_6">Adam</xref>, <xref ref-type="sec" rid="s6_5_10">AdamW</xref>, AdamWR. See Figure <xref ref-type="fig" rid="fig-75">75</xref> for the scheduling of the annealing multiplier <inline-formula id="ieqn-703"><mml:math id="mml-ieqn-703"><mml:msub><mml:mrow><mml:mi>&#x1D586;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, for which the epoch numbers (100, 300, 700, 1500) for complete cooling (<inline-formula id="ieqn-704"><mml:math id="mml-ieqn-704"><mml:msub><mml:mrow><mml:mi>&#x1D586;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>) coincided with the same epoch numbers for the sharp minima. There was, however, a diminishing return beyond the 4th cycle as indicated by the dotted arrows, for both training error and test error, which actually increased at the end of the 4th cycle (right subfigure, red arrow), see Section <xref ref-type="sec" rid="s6_1">6.1</xref> on early-stopping criteria and <xref ref-type="statement" rid="st6_14">Remark 6.14</xref>. Adapted from [<xref ref-type="bibr" rid="ref-56">56</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-74.tif"/>
</fig>
<p>which is equivalent to decaying the parameters (including the weights in) <inline-formula id="ieqn-1322"><mml:math id="mml-ieqn-1322"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> when <inline-formula id="ieqn-1323"><mml:math id="mml-ieqn-1323"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, but with varying decay parameter <inline-formula id="ieqn-1324"><mml:math id="mml-ieqn-1324"><mml:msub><mml:mrow><mml:mi>&#x1D521;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> depending on the step length <inline-formula id="ieqn-1325"><mml:math id="mml-ieqn-1325"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, which itself would decrease toward zero.</p>
<p>The same equivalence between <inline-formula id="ieqn-1326"><mml:math id="mml-ieqn-1326"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> regularization Eq. (<xref ref-type="disp-formula" rid="eqn-248">248</xref>) and weight decay Eq. (<xref ref-type="disp-formula" rid="eqn-250">250</xref>), which is linear with respect to the gradient, cannot be said for adaptive methods due to the nonlinearity with respect to the gradient in the update procedure using Eq. (<xref ref-type="disp-formula" rid="eqn-200">200</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-201">201</xref>). See also the parameter update in lines 12-13 of the unified pseudocode for adaptive methods in Algorithm <xref ref-type="fig" rid="fig-163">5</xref> and Footnote <xref ref-type="fn" rid="fn176">176</xref>. Thus, it was proposed in [<xref ref-type="bibr" rid="ref-56">56</xref>] to explicitly add weight decay to the parameter update Eq. (<xref ref-type="disp-formula" rid="eqn-201">201</xref>) for AdamW (lines <xref ref-type="fig" rid="fig-163">12</xref>-<xref ref-type="fig" rid="fig-163">13</xref> in Algorithm <xref ref-type="fig" rid="fig-163">5</xref>) as follows:</p>
<p><disp-formula id="eqn-251"><label>(251)</label><mml:math id="mml-eqn-251" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">d</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;(element-wise operations)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the parameter <inline-formula id="ieqn-1327"><mml:math id="mml-ieqn-1327"><mml:msub><mml:mrow><mml:mi>&#x1D51E;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is an annealing multiplier defined in Section <xref ref-type="sec" rid="s6_3_4">6.3.4</xref> on weight decay:</p>
<p><disp-formula id="eqn-154a"><mml:math id="mml-eqn-154a" display="block"><mml:mrow><mml:msub><mml:mi mathvariant='fraktur'>a</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn><mml:mo>+</mml:mo><mml:mn>0.5</mml:mn><mml:mi>cos</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x03C0;</mml:mi><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mtext>with</mml:mtext><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>The results of the numerical experiments on the CIFAR-10 dataset using <xref ref-type="sec" rid="s6_5_6">Adam</xref>, AdamW, SGDW (Weight decay), AdamWR (Warm Restart), and SGDWR were reported in Figure <xref ref-type="fig" rid="fig-74">74</xref> [<xref ref-type="bibr" rid="ref-56">56</xref>].</p>
<statement id="st6_14"><title><xref ref-type="statement" rid="st6_14">Remark 6.14</xref>.</title>
<p>Limitation of cyclic annealing. The 5th cycle of annealing not shown in Figure <xref ref-type="fig" rid="fig-74">74</xref> would end at epoch <inline-formula id="ieqn-1328"><mml:math id="mml-ieqn-1328"><mml:mn>1500</mml:mn><mml:mo>+</mml:mo><mml:mn>1600</mml:mn><mml:mo>=</mml:mo><mml:mn>3100</mml:mn></mml:math></inline-formula>, which is well beyond epoch budget of <inline-formula id="ieqn-1329"><mml:math id="mml-ieqn-1329"><mml:mn>1800</mml:mn></mml:math></inline-formula>. In view of the diminishing return in the decrease of the training error at the end of each cycle, in addition to an increase in the test error by the end of the 4th cycle, as shown in Figure <xref ref-type="fig" rid="fig-74">74</xref>, it is unlikely that it is worthwhile to start the 5th cycle, since not only the computation would be more expensive, due to the warm restart, the increase in the test error indicated that the end of the 3rd cycle was optimal, and thus the reason for [<xref ref-type="bibr" rid="ref-56">56</xref>] to stop at the end of the 4th cycle.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-75">
<label>Figure 75</label>
<caption><title><italic>Cosine annealing</italic> (Sections <xref ref-type="sec" rid="s6_3_4">6.3.4</xref>, <xref ref-type="sec" rid="s6_5_10">6.5.10</xref>). Annealing factor <inline-formula id="ieqn-705"><mml:math id="mml-ieqn-705"><mml:msub><mml:mrow><mml:mi>&#x1D51E;</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> as a function of epoch number. Four annealing cycles <inline-formula id="ieqn-706"><mml:math id="mml-ieqn-706"><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula>, with the following schedule for <inline-formula id="ieqn-707"><mml:math id="mml-ieqn-707"><mml:msub><mml:mi>T</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-154">154</xref>): (1) Cycle 1, <inline-formula id="ieqn-708"><mml:math id="mml-ieqn-708"><mml:msub><mml:mi>T</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>100</mml:mn></mml:math></inline-formula> epochs, epoch 0 to epoch 100, (2) Cycle 2, <inline-formula id="ieqn-709"><mml:math id="mml-ieqn-709"><mml:msub><mml:mi>T</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>200</mml:mn></mml:math></inline-formula> epochs, epoch 101 to epoch 300, (3) Cycle 3, <inline-formula id="ieqn-710"><mml:math id="mml-ieqn-710"><mml:msub><mml:mi>T</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>400</mml:mn></mml:math></inline-formula> epochs, epoch 301 to epoch 700, (4) Cycle 4, <inline-formula id="ieqn-711"><mml:math id="mml-ieqn-711"><mml:msub><mml:mi>T</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>800</mml:mn></mml:math></inline-formula> epochs, epoch 701 to epoch 1500. From [<xref ref-type="bibr" rid="ref-56">56</xref>]. See Figure <xref ref-type="fig" rid="fig-74">74</xref> in which the curves for AdamWR and SGDWR ended at epoch 1500. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-75.tif"/>
</fig>
<p>The results for test errors in Figure <xref ref-type="fig" rid="fig-74">74</xref> appeared to confirm the <xref ref-type="sec" rid="s6_5_9">criticism</xref> in [<xref ref-type="bibr" rid="ref-55">55</xref>] that adaptive methods brought about &#x201C;marginal value&#x201D; compared to the classic <xref ref-type="sec" rid="s6_3_1">SGD</xref>. Such observation was also in agreement with [<xref ref-type="bibr" rid="ref-168">168</xref>], where it was stated:</p>
<disp-quote><p>&#x201C;In our experiments, either AdaBayes or AdaBayes-SS outperformed other adaptive methods, including AdamW (Loshchilov &amp; Hutter, 2017), and Ada/AMSBound (Luo et al., 2019), though SGD frequently outperformed all adaptive methods.&#x201D; (See Figure <xref ref-type="fig" rid="fig-76">76</xref> and Figure <xref ref-type="fig" rid="fig-77">77</xref>)</p>
</disp-quote><p>If <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref> generated &#x201C;a lot of noise for nothing&#x201D; compared to <xref ref-type="sec" rid="s6_5_6">Adam</xref> and <xref ref-type="sec" rid="s6_5_10">AdamW</xref>, according to [<xref ref-type="bibr" rid="ref-209">209</xref>], then does &#x201C;marginal value&#x201D; mean that adaptive methods in general generated a lot of noise for not much, compared to <xref ref-type="sec" rid="s6_3_1">SGD</xref> ?</p>
<p>The work in [<xref ref-type="bibr" rid="ref-55">55</xref>], [<xref ref-type="bibr" rid="ref-168">168</xref>], and [<xref ref-type="bibr" rid="ref-56">56</xref>] proved once more that a classic like SGD introduced by Robbins &amp; Monro (1951b) [<xref ref-type="bibr" rid="ref-49">49</xref>] never dies, and would be motivation to generalize classical deterministic first and second-order optimization methods together with line search methods to add stochasticity. We will review in detail two papers along this line: [<xref ref-type="bibr" rid="ref-144">144</xref>] and [<xref ref-type="bibr" rid="ref-145">145</xref>].</p>
</sec> </sec>
<sec id="s6_6"><label>6.6</label>
<title>SGD with Armijo line search and adaptive minibatch</title>
<p>In parallel to the deterministic choice of step length based on Armijo&#x2019;s rule in Eq. (<xref ref-type="disp-formula" rid="eqn-125">125</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-126">126</xref>), we have the following respective stochastic version proposed by [<xref ref-type="bibr" rid="ref-144">144</xref>]:<xref ref-type="fn" rid="fn202"><sup>202</sup></xref><fn id="fn202"><label>202</label><p>It is not until Section 4.7 in [<xref ref-type="bibr" rid="ref-144">144</xref>] that this version is presented for the general descent of nonconvex case, whereas the pseudocode in their Algorithm 1 at the beginning of their paper, and referred to in Section 4.5, was restricted to steepest descent for convex case.</p></fn></p>
<p><disp-formula id="eqn-252"><label>(252)</label><mml:math id="mml-eqn-252" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>&#x03F5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>j</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mi>&#x03C1;</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>j</mml:mi></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>j</mml:mi></mml:msup></mml:mrow><mml:mi>&#x03C1;</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-253"><label>(253)</label><mml:math id="mml-eqn-253" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-76">
<label>Figure 76</label>
<caption><title><italic>CIFAR-100 test loss using Resnet-34 and DenseNet-121</italic> (Section <xref ref-type="sec" rid="s6_5_10">6.5.10</xref>). Comparison between various optimizers, including <xref ref-type="sec" rid="s6_5_6">Adam</xref> and <xref ref-type="sec" rid="s6_5_10">AdamW</xref>, showing that <xref ref-type="sec" rid="s6_3_1">SGD</xref> achieved the lowest global minimum loss (blue line) compared to all adaptive methods tested as shown [<xref ref-type="bibr" rid="ref-168">168</xref>]. See also Figure <xref ref-type="fig" rid="fig-77">77</xref> and Section <xref ref-type="sec" rid="s6_1">6.1</xref> on early-stopping criteria. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-76.tif"/>
</fig>
<fig id="fig-77">
<label>Figure 77</label>
<caption><title><italic>SGD frequently outperformed all adaptive methods</italic> (Section <xref ref-type="sec" rid="s6_5_10">6.5.10</xref>). The table contains the global minimum for each optimizer, for each of the two datasets CIFAR-10 and CIFAR-100, using two different networks. For each network, an error percentage and the loss (cost) were given. Shown in red are the lowest global minima obtained by <xref ref-type="sec" rid="s6_3_1">SGD</xref> in the corresponding columns. Even in the three columns in which <xref ref-type="sec" rid="s6_3_1">SGD</xref> results were not the lowest, two <xref ref-type="sec" rid="s6_3_1">SGD</xref> results were just slightly above those of <xref ref-type="sec" rid="s6_5_10">AdamW</xref> (1st and 3rd columns), and one even smaller (4th column). <xref ref-type="sec" rid="s6_3_1">SGD</xref> clearly beat <xref ref-type="sec" rid="s6_5_6">Adam</xref>, <xref ref-type="sec" rid="s6_5_2">AdaGrad</xref>, <xref ref-type="sec" rid="s6_5_7">AMSGrad</xref> [<xref ref-type="bibr" rid="ref-168">168</xref>]. See also Figure <xref ref-type="fig" rid="fig-76">76</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-77.tif"/>
</fig>
<fig id="fig-164">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-164.tif"/>
</fig>
<p>where the overhead tilde of a quantity designates an estimate of that quantity based on a randomly selected minibatch, i.e., <inline-formula id="ieqn-1337"><mml:math id="mml-ieqn-1337"><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is the cost estimate, <inline-formula id="ieqn-1338"><mml:math id="mml-ieqn-1338"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> the descent-direction estimate, and <inline-formula id="ieqn-1339"><mml:math id="mml-ieqn-1339"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> the gradient estimate, similar to those in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>.</p>
<p>There is a difference though: The standard SGD in Algorithm <xref ref-type="fig" rid="fig-162">4</xref> uses a fixed minibatch for the computation of the cost estimate and the gradient estimate, whereas Algorithm <xref ref-type="fig" rid="fig-164">6</xref> in [<xref ref-type="bibr" rid="ref-144">144</xref>] uses adaptive subprocedure to adjust the size of the minibatches to achieve a desired (fixed) probability <inline-formula id="ieqn-1340"><mml:math id="mml-ieqn-1340"><mml:msub><mml:mi>p</mml:mi><mml:mi>J</mml:mi></mml:msub></mml:math></inline-formula> that the cost estimate is close to the true cost, and a desired probability <inline-formula id="ieqn-1341"><mml:math id="mml-ieqn-1341"><mml:msub><mml:mi>p</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> that the gradient estimate is close to the true gradient. These adaptive-minibatch subprocedures are also functions of the learning rate (step length) <inline-formula id="ieqn-1342"><mml:math id="mml-ieqn-1342"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>, conceptually written as:</p>
<p><disp-formula id="eqn-254"><label>(254)</label><mml:math id="mml-eqn-254" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>J</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which are the counterparts to the fixed-minibatch procedures in Eq. (<xref ref-type="disp-formula" rid="eqn-139">139</xref>) and Eq. <xref ref-type="disp-formula" rid="eqn-140">(140)</xref>, respectively.</p> 
<statement id="st6_15"><title><xref ref-type="statement" rid="st6_15">Remark 6.15</xref>.</title>
<p>Since the appropriate size of the minibatch depends on the gradient estimate, which is not known and which is computed based on the minibatch itself, the adaptive-minibatch subprocedures for cost estimate <inline-formula id="ieqn-1343"><mml:math id="mml-ieqn-1343"><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> and for gradient estimate <inline-formula id="ieqn-1344"><mml:math id="mml-ieqn-1344"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-254">254</xref>) contain a loop, started by guessing the gradient estimate, to gradually increase the size of the minibatch by adding more samples until certain criteria are met.<xref ref-type="fn" rid="fn203"><sup>203</sup></xref><fn id="fn203"><label>203</label><p>See [<xref ref-type="bibr" rid="ref-144">144</xref>], p. 7, below Eq. (2.4).</p></fn></p>
<p>In addition, since both the cost estimate <inline-formula id="ieqn-1345"><mml:math id="mml-ieqn-1345"><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> and the gradient estimate <inline-formula id="ieqn-1346"><mml:math id="mml-ieqn-1346"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> depend on the step size <inline-formula id="ieqn-1347"><mml:math id="mml-ieqn-1347"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>, the Armijo line-search loop to determined the step length <inline-formula id="ieqn-1348"><mml:math id="mml-ieqn-1348"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula>&#x2014;denoted as the &#x2461; for loop in the deterministic Algorithm <xref ref-type="fig" rid="fig-160">2</xref>&#x2014;is combined with the iteration loop <inline-formula id="ieqn-1349"><mml:math id="mml-ieqn-1349"><mml:mi>k</mml:mi></mml:math></inline-formula> in Algorithm <xref ref-type="fig" rid="fig-164">6</xref>, where these two combined loops are denoted as the &#x2461;&#x2462; <bold>for</bold> loop.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>The same relationship between <inline-formula id="ieqn-1350"><mml:math id="mml-ieqn-1350"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-1351"><mml:math id="mml-ieqn-1351"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-121">121</xref>) holds:</p>
<p><disp-formula id="eqn-255"><label>(255)</label><mml:math id="mml-eqn-255" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi>d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For SGD, the descent-direction estimate <inline-formula id="ieqn-1352"><mml:math id="mml-ieqn-1352"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is identified with the steepest-descent direction estimate <inline-formula id="ieqn-1353"><mml:math id="mml-ieqn-1353"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-256"><label>(256)</label><mml:math id="mml-eqn-256" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For Newton-type algorithms, such as in [<xref ref-type="bibr" rid="ref-145">145</xref>] [<xref ref-type="bibr" rid="ref-146">146</xref>], the descent direction estimate <inline-formula id="ieqn-1354"><mml:math id="mml-ieqn-1354"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is set to equal to the Hessian estimate <inline-formula id="ieqn-1355"><mml:math id="mml-ieqn-1355"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> multiplied by the steepest-descent direction estimate <inline-formula id="ieqn-1356"><mml:math id="mml-ieqn-1356"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-257"><label>(257)</label><mml:math id="mml-eqn-257" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<statement id="st6_16"><title><xref ref-type="statement" rid="st6_16">Remark 6.16</xref>.</title>
<p>In the SGD with Armijo line search and adaptive minibatch Algorithm <xref ref-type="fig" rid="fig-164">6</xref>, the reliability parameter <inline-formula id="ieqn-1357"><mml:math id="mml-ieqn-1357"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> and its use is another difference between Algorithm <xref ref-type="fig" rid="fig-164">6</xref> and Algorithm <xref ref-type="fig" rid="fig-160">2</xref>, the deterministic gradient descent with Armijo line search, and similarly for Algorithm <xref ref-type="fig" rid="fig-162">4</xref>, the standard SGD. The reason was provided in [<xref ref-type="bibr" rid="ref-144">144</xref>]: Even when the probability of gradient estimate and cost estimate is near 1, it is not guaranteed that the expected value of the cost at the next iterate <inline-formula id="ieqn-1358"><mml:math id="mml-ieqn-1358"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> would be below the cost at the current iterate <inline-formula id="ieqn-1359"><mml:math id="mml-ieqn-1359"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, due to arbitrary increase of the cost. &#x201C;Since random gradient may not be representative of the true gradient the function estimate accuracy and thus the expected improvement needs to be controlled by a different quantity,&#x201D; <inline-formula id="ieqn-1360"><mml:math id="mml-ieqn-1360"><mml:msubsup><mml:mi>&#x03B4;</mml:mi><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>The authors of [<xref ref-type="bibr" rid="ref-144">144</xref>] provided a rigorous convergence analysis of their proposed Algorithm <xref ref-type="fig" rid="fig-164">6</xref>, but had not implemented their method, and thus had no numerical results at the time of this writing.<xref ref-type="fn" rid="fn204"><sup>204</sup></xref><fn id="fn204"><label>204</label><p>Based on our private communications with the authors of [<xref ref-type="bibr" rid="ref-144">144</xref>] on 2019.11.16.</p></fn> Without empirical evidence that the algorithm works and is competitive compared to SGD (see adaptive methods and their criticism in Section <xref ref-type="sec" rid="s6_5_9">6.5.9</xref>), there would be no adoption.</p>
<table-wrap id="table-4"><label>Table 4</label>
<caption>
<p><italic>Armijo parameters</italic> (<xref ref-type="sec" rid="s6_2_3">Section 6.2.3</xref>, <xref ref-type="fig" rid="fig-160">Algorithms 2</xref>, <xref ref-type="fig" rid="fig-164">6</xref>, <xref ref-type="fig" rid="fig-165">7</xref>). Comparing our notations here as used in Eq. (<xref ref-type="disp-formula" rid="eqn-125">125</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-126">126</xref>) to those of other authors: [<xref ref-type="bibr" rid="ref-139">139</xref>] [<xref ref-type="bibr" rid="ref-144">144</xref>] [<xref ref-type="bibr" rid="ref-145">145</xref>] [<xref ref-type="bibr" rid="ref-151">151</xref>]. In [<xref ref-type="bibr" rid="ref-151">151</xref>] [<xref ref-type="bibr" rid="ref-139">139</xref>] [<xref ref-type="bibr" rid="ref-144">144</xref>], the parameters are for the first-order term <inline-formula id="ieqn-394"><mml:math id="mml-ieqn-394"><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>g</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00B7;</mml:mo><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mi>d</mml:mi> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> (for steepest descent, <inline-formula id="ieqn-395"><mml:math id="mml-ieqn-395"><mml:mrow><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo>&#x2202;</mml:mo><mml:mi>j</mml:mi><mml:mo>/</mml:mo><mml:mo>&#x2202;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula>) in the Taylor series expansion of <inline-formula id="ieqn-396"><mml:math id="mml-ieqn-396"><mml:mrow><mml:mi>&#x0394;</mml:mi><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. In [<xref ref-type="bibr" rid="ref-145">145</xref>], the parameters are for the second-order term in the Taylor series expansion, leading to the cube of the norm of the descent direction, <inline-formula id="ieqn-397"><mml:math id="mml-ieqn-397"><mml:msup><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mi>d</mml:mi> <mml:mo>&#x2016;</mml:mo></mml:mrow><mml:mn>3</mml:mn></mml:msup></mml:math></inline-formula>. For the stochastic optimization algorithms presented in this paper, the 4th parameter <inline-formula id="ieqn-398"><mml:math id="mml-ieqn-398"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> is introduced to represent the reliability parameter <inline-formula id="ieqn-399"><mml:math id="mml-ieqn-399"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> of [<xref ref-type="bibr" rid="ref-144">144</xref>] (Algorithm <xref ref-type="fig" rid="fig-164">6</xref>) and the stability parameter <inline-formula id="ieqn-400"><mml:math id="mml-ieqn-400"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> of [<xref ref-type="bibr" rid="ref-145">145</xref>] (Algorithm <xref ref-type="fig" rid="fig-165">7</xref>). Deterministic algorithms in [<xref ref-type="bibr" rid="ref-151">151</xref>] and [<xref ref-type="bibr" rid="ref-139">139</xref>] do not have this 4th parameter.</p></caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="center">Parameter type</th>
<th align="center">This paper</th>
<th align="center">Polak [<xref ref-type="bibr" rid="ref-139">139</xref>]</th>
<th align="center">Paquette [<xref ref-type="bibr" rid="ref-144">144</xref>]</th>
<th align="center">Bergou [<xref ref-type="bibr" rid="ref-145">145</xref>]</th>
<th align="center">Armijo [<xref ref-type="bibr" rid="ref-151">151</xref>]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">Fixed factor</td>
<td align="center"><inline-formula id="ieqn-401"><mml:math id="mml-ieqn-401"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-402"><mml:math id="mml-ieqn-402"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-403"><mml:math id="mml-ieqn-403"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-404"><mml:math id="mml-ieqn-404"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>6</mml:mn></mml:mfrac></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-405"><mml:math id="mml-ieqn-405"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula></td>
</tr>
<tr>
<td align="center">Varying-power factor</td>
<td align="center"><inline-formula id="ieqn-406"><mml:math id="mml-ieqn-406"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-407"><mml:math id="mml-ieqn-407"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-408"><mml:math id="mml-ieqn-408"><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-409"><mml:math id="mml-ieqn-409"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-410"><mml:math id="mml-ieqn-410"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula></td>
</tr>
<tr>
<td align="center">Fixed factor</td>
<td align="center"><inline-formula id="ieqn-411"><mml:math id="mml-ieqn-411"><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-412"><mml:math id="mml-ieqn-412"><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-413"><mml:math id="mml-ieqn-413"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-414"><mml:math id="mml-ieqn-414"><mml:mi>&#x03B7;</mml:mi></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-415"><mml:math id="mml-ieqn-415"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula></td>
</tr>
<tr>
<td align="center">Reliability / Stability</td>
<td align="center"><inline-formula id="ieqn-416"><mml:math id="mml-ieqn-416"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula></td>
<td align="center">&#x2014;</td>
<td align="center"><inline-formula id="ieqn-417"><mml:math id="mml-ieqn-417"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula></td>
<td align="center"><inline-formula id="ieqn-418"><mml:math id="mml-ieqn-418"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula></td>
<td align="center">&#x2014;</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-78">
<label>Figure 78</label>
<caption><title><italic>Stochastic Newton with Armijo-like 2nd order line search</italic> (Section <xref ref-type="sec" rid="s6_7">6.7</xref>). IJCNN1 dataset from the LIBSVM library. Three batch sizes were used (1%, 5%, 100%) for both SGD and ALAS (stochastic Newton Algorithm <xref ref-type="fig" rid="fig-165">7</xref>). The exact gradient norm for each of these six cases was plotted against the training epochs on the left, and against the iteration numbers on the right. An epoch is the number of non-overlapping minibatches (and thus iterations) to cover the whole training set. One epoch for a minibatch size of s% (respectively 1%, 5%, 100%) of the training set is equivalent to 100/<italic>s</italic> (respectively 100, 20, 1) iterations. Thus, for SGD-ALGO (1%), as shown, 10 epochs is equivalent to 1,000 iterations, with the same gradient norm. The markers on the curves were placed every 100 epochs (left) and 800 iterations (right). For the same number of epochs, say 10, SGD with smaller minibatches yielded lower gradient norm. The same was true for Algorithm <xref ref-type="fig" rid="fig-165">7</xref> for number of epochs less than 10, but the gradient norm plateaued out after that with a lot of noise. Second-order Algorithm <xref ref-type="fig" rid="fig-165">7</xref> converged faster than 1st-order SGD. See [<xref ref-type="bibr" rid="ref-145">145</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-78.tif"/>
</fig>
</sec>
<sec id="s6_7"><label>6.7</label>
<title>Stochastic Newton method with 2nd-order line search</title>
<p>The stochastic Newton method in Algorithm <xref ref-type="fig" rid="fig-165">7</xref>, described in [<xref ref-type="bibr" rid="ref-145">145</xref>], is a generalization of the deterministic Newton method in Algorithm <xref ref-type="fig" rid="fig-161">3</xref> to add stochasticity via random selection of minibatches and Armijo-like 2nd-order line search.<xref ref-type="fn" rid="fn205"><sup>205</sup></xref><fn id="fn205"><label>205</label><p>The Armijo line search itself is 1st order; see Section <xref ref-type="sec" rid="s6_2">6.2</xref> on full-batch deterministic optimization.</p></fn></p>
<p>Upon a random selection of a minibatch as in Eq. (<xref ref-type="disp-formula" rid="eqn-137">137</xref>), the computation of the estimates for the cost function <inline-formula id="ieqn-1361"><mml:math id="mml-ieqn-1361"><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-139">139</xref>), the gradient <inline-formula id="ieqn-1362"><mml:math id="mml-ieqn-1362"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-140">140</xref>), and the Hessian <inline-formula id="ieqn-1363"><mml:math id="mml-ieqn-1363"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> (new quantity) can proceed similar to that used for the standard SGD in Algorithm <xref ref-type="fig" rid="fig-162">4</xref>, recalled below for convenience:</p>
<p><disp-formula id="eqn-137a"><mml:math id="mml-eqn-137a" display="block"><mml:mrow><mml:msup><mml:mi mathvariant='double-struck'>B</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mtext>m</mml:mtext><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo>&#x007B;</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mtext>m</mml:mtext></mml:msub><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>&#x007D;</mml:mo><mml:mo>&#x2286;</mml:mo><mml:mi mathvariant='double-struck'>X</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x007B;</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mtext>M</mml:mtext><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>&#x007D;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-139a"><mml:math id="mml-eqn-139a" display="block"><mml:mrow><mml:mover accent='true'><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mtext>m</mml:mtext></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mtext>m</mml:mtext><mml:mo>&#x2264;</mml:mo><mml:mtext>M</mml:mtext></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;with&#x2009;</mml:mtext><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;and&#x2009;</mml:mtext><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant='double-struck'>B</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mtext>m</mml:mtext><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant='double-struck'>T</mml:mi><mml:mrow><mml:mo>&#x007C;</mml:mo><mml:mtext>m</mml:mtext><mml:mo>&#x007C;</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-140a"><mml:math id="mml-eqn-140a" display="block"><mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mstyle></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mtext>m</mml:mtext></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mtext>m</mml:mtext><mml:mo>&#x2264;</mml:mo><mml:mtext>M</mml:mtext></mml:mrow></mml:munderover><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mover accent='true'><mml:mi>J</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>
<disp-formula id="eqn-258"><label>(258)</label><mml:math id="mml-eqn-258" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>&#x1D5C6;</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msub><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In the above computation, the minibatch in Algorithm <xref ref-type="fig" rid="fig-165">7</xref> is fixed, not adaptive such as in Algorithm <xref ref-type="fig" rid="fig-164">6</xref>.</p>
<p>If the current iterate <inline-formula id="ieqn-1364"><mml:math id="mml-ieqn-1364"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is in a flat region (or plateau) or at the bottom of a local convex bowl, then the smallest eigenvalue <inline-formula id="ieqn-1365"><mml:math id="mml-ieqn-1365"><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> of the Hessian estimate <inline-formula id="ieqn-1366"><mml:math id="mml-ieqn-1366"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> would be close to zero or positive, respectively, and the gradient estimate would be zero (line <xref ref-type="fig" rid="fig-165">7</xref> in Algorithm <xref ref-type="fig" rid="fig-165">7</xref>):</p>
<p><disp-formula id="eqn-259"><label>(259)</label><mml:math id="mml-eqn-259" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2265;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2225;=</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1367"><mml:math id="mml-ieqn-1367"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> is a small positive number. In this case, no step (or update) would be taken, which is equivalent to setting step length to zero or descent direction to zero<xref ref-type="fn" rid="fn206"><sup>206</sup></xref><fn id="fn206"><label>206</label><p>See Step 2 of Algorithm 1 in [<xref ref-type="bibr" rid="ref-145">145</xref>].</p></fn>, then go to the next iteration <inline-formula id="ieqn-1368"><mml:math id="mml-ieqn-1368"><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Otherwise, i.e., the conditions in Eq. (<xref ref-type="disp-formula" rid="eqn-259">259</xref>) are not met, do the remaining steps.</p>
<p>If the current iterate <inline-formula id="ieqn-1369"><mml:math id="mml-ieqn-1369"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is on the downward side of a saddle point, characterized by the condition that the smallest eigenvalue <inline-formula id="ieqn-1370"><mml:math id="mml-ieqn-1370"><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is clearly negative (line <xref ref-type="fig" rid="fig-165">10</xref> in Algorithm <xref ref-type="fig" rid="fig-165">7</xref>):</p>
<p><disp-formula id="eqn-260"><label>(260)</label><mml:math id="mml-eqn-260" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>then find the eigenvector <inline-formula id="ieqn-1371"><mml:math id="mml-ieqn-1371"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> corresponding to <inline-formula id="ieqn-1372"><mml:math id="mml-ieqn-1372"><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, and scale it such that its norm is equal to the absolute value of this negative smallest eigenvalue, and such that this eigenvector forms an obtuse angle with the gradient <inline-formula id="ieqn-1373"><mml:math id="mml-ieqn-1373"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, then select such eigenvector as the descent direction <inline-formula id="ieqn-1374"><mml:math id="mml-ieqn-1374"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-261"><label>(261)</label><mml:math id="mml-eqn-261" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;such that&#x00A0;</mml:mtext></mml:mrow><mml:mo>&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2225;=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>When the iterate <inline-formula id="ieqn-1375"><mml:math id="mml-ieqn-1375"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is in a local convex bowl, the Hessian <inline-formula id="ieqn-1376"><mml:math id="mml-ieqn-1376"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is positive definite, i.e., the smallest eigenvalue <inline-formula id="ieqn-1377"><mml:math id="mml-ieqn-1377"><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is strictly positive, then use the Newton direction as descent direction <inline-formula id="ieqn-1378"><mml:math id="mml-ieqn-1378"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> (line <xref ref-type="fig" rid="fig-165">13</xref> in Algorithm <xref ref-type="fig" rid="fig-165">7</xref>):</p>
<p><disp-formula id="eqn-262"><label>(262)</label><mml:math id="mml-eqn-262" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x003E;&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The remaining case is when the iterate <inline-formula id="ieqn-1379"><mml:math id="mml-ieqn-1379"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is close to a saddle point such that the smallest eigenvalue <inline-formula id="ieqn-1380"><mml:math id="mml-ieqn-1380"><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is bounded below by <inline-formula id="ieqn-1381"><mml:math id="mml-ieqn-1381"><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and above by <inline-formula id="ieqn-1382"><mml:math id="mml-ieqn-1382"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, then <inline-formula id="ieqn-1383"><mml:math id="mml-ieqn-1383"><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is nearly zero, and thus the Hessian estimate <inline-formula id="ieqn-1384"><mml:math id="mml-ieqn-1384"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is nearly singular. Regularize (or stabilize) the Hessian estimate, i.e, move its smallest eigenvalue away from zero, by adding a small perturbation diagonal matrix using the sum of the bounds <inline-formula id="ieqn-1385"><mml:math id="mml-ieqn-1385"><mml:msup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1386"><mml:math id="mml-ieqn-1386"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. The regularized (or stabilized) Hessian estimate is no longer nearly singular, thus invertible, and can be used to find the Newton descent direction (lines <xref ref-type="fig" rid="fig-165">16</xref>-<xref ref-type="fig" rid="fig-165">17</xref> in Algorithm <xref ref-type="fig" rid="fig-165">7</xref> for stochastic Newton and line <xref ref-type="fig" rid="fig-161">15</xref> in Algorithm <xref ref-type="fig" rid="fig-161">3</xref> for deterministic Newton):</p>
<p><disp-formula id="eqn-263"><label>(263)</label><mml:math id="mml-eqn-263" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2264;&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x003C;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mo>&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>If the stopping criterion is not met, use Armijo&#x2019;s rule to find the step length <inline-formula id="ieqn-1387"><mml:math id="mml-ieqn-1387"><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to update the parameter <inline-formula id="ieqn-1388"><mml:math id="mml-ieqn-1388"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-1389"><mml:math id="mml-ieqn-1389"><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, then go to the next iteration <inline-formula id="ieqn-1390"><mml:math id="mml-ieqn-1390"><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. (line <xref ref-type="fig" rid="fig-165">19</xref> in Algorithm <xref ref-type="fig" rid="fig-165">7</xref>). The authors of [<xref ref-type="bibr" rid="ref-145">145</xref>] provided a detailed discussion of their stopping criterion. Otherwise, the stopping criterion is met, accept the current iterate <inline-formula id="ieqn-1391"><mml:math id="mml-ieqn-1391"><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as local minimizer estimate <inline-formula id="ieqn-1392"><mml:math id="mml-ieqn-1392"><mml:msup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula>, stop the Newton-descent &#x2461; <bold>for</bold> loop to end the current training epoch <inline-formula id="ieqn-1393"><mml:math id="mml-ieqn-1393"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula>.</p>
<p>From Eq. (<xref ref-type="disp-formula" rid="eqn-125">125</xref>), the deterministic 1st-order Armijo&#x2019;s rule for steepest descent can be written as:</p>
<p><disp-formula id="eqn-264"><label>(264)</label><mml:math id="mml-eqn-264" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>a</mml:mi></mml:msup><mml:mi>&#x03C1;</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">d</mml:mi><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#x00A0;with&#x00A0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-1394"><mml:math id="mml-ieqn-1394"><mml:mi>a</mml:mi></mml:math></inline-formula> being the minimum power for Eq. (<xref ref-type="disp-formula" rid="eqn-264">264</xref>) to be satisfied. In Algorithm <xref ref-type="fig" rid="fig-165">7</xref>, the Armijo-like 2nd-order line search reads as follows:</p>
<p><disp-formula id="eqn-265"><label>(265)</label><mml:math id="mml-eqn-265" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>6</mml:mn></mml:mfrac><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mi>a</mml:mi></mml:msup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>3</mml:mn></mml:msup><mml:mi>&#x03C1;</mml:mi><mml:mo>&#x2225;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>3</mml:mn></mml:msup><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-1395"><mml:math id="mml-ieqn-1395"><mml:mi>a</mml:mi></mml:math></inline-formula> being the minimum power for Eq. (<xref ref-type="disp-formula" rid="eqn-265">265</xref>) to be satisfied. The parallelism between Eq. (<xref ref-type="disp-formula" rid="eqn-265">265</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-264">264</xref>) is clear; see also <xref ref-type="table" rid="table-4">Table 4</xref>.</p>
<p>Figure <xref ref-type="fig" rid="fig-78">78</xref> shows the numerical results of Algorithm <xref ref-type="fig" rid="fig-165">7</xref> on the IJCNN1 dataset ([<xref ref-type="bibr" rid="ref-211">211</xref>]) from the Library for Vector Support Machine (LIBSVM) library by [<xref ref-type="bibr" rid="ref-212">212</xref>]. It is not often to see plots versus epochs side by side with plots versus iterations. Some papers may have only plots versus iterations (e.g., [<xref ref-type="bibr" rid="ref-182">182</xref>]); other papers may rely only on plots versus epochs to draw conclusions (e.g., [<xref ref-type="bibr" rid="ref-56">56</xref>]). Thus Figure <xref ref-type="fig" rid="fig-78">78</xref> provides a good example to see the differences, as noted in <xref ref-type="statement" rid="st6_17">Remark 6.17</xref>.</p> 
<statement id="st6_17"><title><xref ref-type="statement" rid="st6_17">Remark 6.17</xref>.</title>
<p><italic>Epoch counter vs global iteration counter in plots</italic>. When plotted gradient norm versus epochs (left of Figure <xref ref-type="fig" rid="fig-78">78</xref>), the three curves for <xref ref-type="sec" rid="s6_3_1">SGD</xref> were separated, with faster convergence for smaller minibatch sizes Eq. (<xref ref-type="disp-formula" rid="eqn-135">135</xref>), but the corresponding three curves fell on top of each other when plotted versus iterations (right of Figure <xref ref-type="fig" rid="fig-78">78</xref>). The reason was the scale on the horizontal axis was different for each curve, e.g., 1 iteration for full batch was equivalent to 100 iterations for minibatch size at 1% of full batch. While the plot versus iterations was the zoom-in view, but for each curve separately. To compare the rates of convergence among different algorithms and different minibatch sizes, look at the plots versus epochs, since each epoch covers the whole training set. It is just an optical illusion to think that <xref ref-type="sec" rid="s6_3_1">SGD</xref> with different minibatch sizes had the same rate of convergence.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>The authors of [<xref ref-type="bibr" rid="ref-145">145</xref>] planned to test their Algorithm <xref ref-type="fig" rid="fig-165">7</xref> on large datasets such as the CIFAR-10, and report the results in 2020.<xref ref-type="fn" rid="fn207"><sup>207</sup></xref><fn id="fn207"><label>207</label><p>Per our private correspondence as of 2019.12.18.</p></fn></p>
<p>Another algorithm along the same line as Algorithm <xref ref-type="fig" rid="fig-164">6</xref> and Algorithm <xref ref-type="fig" rid="fig-165">7</xref> is the stochastic quasi-Newton method proposed in [<xref ref-type="bibr" rid="ref-146">146</xref>], where the stochastic Wolfe line search of [<xref ref-type="bibr" rid="ref-143">143</xref>] was employed, but with no numerical experiments on large datasets such as CIFAR-10, etc.</p> 
<p>At the time of this writing, due to lack of numerical results with large datasets commonly used in the deep-learning community such as CIFAR-10, CIFAR-100 and the likes, for testing, and thus lack of comparison of performance in terns of cost and accuracy against Adam and its variants, our assessment is that <xref ref-type="sec" rid="s6_3_1">SGD</xref> and its variants, or <xref ref-type="sec" rid="s6_5_6">Adam</xref> and its better variants, particularly <xref ref-type="sec" rid="s6_5_10">AdamW</xref>, continue to be the prevalent methods for training.</p>
<fig id="fig-165">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-165.tif"/>
</fig>
<p>Time constraint did not allow us to review other stochastic optimization methods such as that with the gradient-only line search in [<xref ref-type="bibr" rid="ref-142">142</xref>] and [<xref ref-type="bibr" rid="ref-179">179</xref>] could not be reviewed here.</p></sec></sec>
<sec id="s7"><label>7</label>
<title>Dynamics, sequential data, sequence modeling</title>
<sec id="s7_1"><label>7.1</label>
<title>Recurrent Neural Networks (RNNs)</title>
<p>In many fields of physics, the respective governing equations that describe the response of a system to (external) stimuli follow a common pattern. The temporal and/or spatial change of some quantity of a system is balanced by sources that cause the change, which is why we refer to equations of this kind as balance relations. The balance of linear momentum in mechanics, for instance, establishes a relationship between the temporal change of a body&#x2019;s linear momentum and the forces acting on the body. Along with kinematic relations and constitutive laws, the balance equations provide the foundation to derive the equations of motion of some mechanical system. For linear problems, the equations of motion constitute a system of second-order ODEs in appropriate (generalized) coordinates <inline-formula id="ieqn-1400"><mml:math id="mml-ieqn-1400"><mml:mi>d</mml:mi></mml:math></inline-formula>, which, in case of continua, is possibly obtained by some spatial discretization,</p>
<p><disp-formula id="eqn-266"><label>(266)</label><mml:math id="mml-eqn-266" display="block"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mi>M</mml:mi><mml:mover><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x2022;</mml:mo><mml:mo>&#x2022;</mml:mo></mml:mrow></mml:mover><mml:mo>+</mml:mo><mml:mi>D</mml:mi><mml:mover><mml:mi>d</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo>+</mml:mo><mml:mi>C</mml:mi><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula></p>
<p>In the above equation, <inline-formula id="ieqn-1401"><mml:math id="mml-ieqn-1401"><mml:mi>M</mml:mi></mml:math></inline-formula> denotes the mass matrix, <inline-formula id="ieqn-1402"><mml:math id="mml-ieqn-1402"><mml:mi>D</mml:mi></mml:math></inline-formula> is a damping matrix and <inline-formula id="ieqn-1403"><mml:math id="mml-ieqn-1403"><mml:mi>C</mml:mi></mml:math></inline-formula> is the stiffness matrix; <inline-formula id="ieqn-1404"><mml:math id="mml-ieqn-1404"><mml:mi>f</mml:mi></mml:math></inline-formula> is the vector of (generalized) forces. The equations of motion can be rewritten as a system of first-order ODEs by introducing the vector of (generalized) velocities <inline-formula id="ieqn-1405"><mml:math id="mml-ieqn-1405"><mml:mi>v</mml:mi></mml:math></inline-formula>,</p>
<p><disp-formula id="eqn-267"><label>(267)</label><mml:math id="mml-eqn-267" display="block"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>d</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>&#x03C5;</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:munder><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mi>I</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi>C</mml:mi></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi>D</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy='true'>&#xFE38;</mml:mo></mml:munder><mml:mi>A</mml:mi></mml:munder><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mi>d</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>v</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:munder><mml:munder><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy='true'>&#xFE38;</mml:mo></mml:munder><mml:mi>B</mml:mi></mml:munder><mml:mi>f</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>In control theory, <inline-formula id="ieqn-1406"><mml:math id="mml-ieqn-1406"><mml:mi>A</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1407"><mml:math id="mml-ieqn-1407"><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula> are referred to as state matrix and input vector, respectively.<xref ref-type="fn" rid="fn208"><sup>208</sup></xref><fn id="fn208"><label>208</label><p>The state-space representation of time-continuous LTI-systems in control theory, see, e.g., Chapter 3 of [<xref ref-type="bibr" rid="ref-213">213</xref>], &#x201C;State Variables and the State Space Description of Dynamic Systems&#x201D; is typically written as <inline-formula id="ieqn-3260"><mml:math id="mml-ieqn-3260"><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>+</mml:mo><mml:mi>B</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi></mml:math></inline-formula>, with the output equation <inline-formula id="ieqn-3261"><mml:math id="mml-ieqn-3261"><mml:mi mathvariant='bold-italic'>y</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">C</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">D</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi></mml:math></inline-formula>. The state vector is denoted by <inline-formula id="ieqn-3262"><mml:math id="mml-ieqn-3262"><mml:mi mathvariant='bold-italic'>x</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-3263"><mml:math id="mml-ieqn-3263"><mml:mi mathvariant='bold-italic'>u</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-3264"><mml:math id="mml-ieqn-3264"><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:math></inline-formula> are the vectors of inputs and outputs, respectively. The ODE describes the (temporal) evolution of the system&#x2019;s state. The output of the system <inline-formula id="ieqn-3265"><mml:math id="mml-ieqn-3265"><mml:mi mathvariant='bold-italic'>y</mml:mi></mml:math></inline-formula> is a linear combination of states <inline-formula id="ieqn-3266"><mml:math id="mml-ieqn-3266"><mml:mi mathvariant='bold-italic'>x</mml:mi></mml:math></inline-formula> and the inputs <inline-formula id="ieqn-3267"><mml:math id="mml-ieqn-3267"><mml:mi mathvariant='bold-italic'>u</mml:mi></mml:math></inline-formula>.</p></fn> If we gather the (generalized) coordinates and velocities in the state vector <inline-formula id="ieqn-1408"><mml:math id="mml-ieqn-1408"><mml:mi mathvariant="bold-italic">q</mml:mi></mml:math></inline-formula>,</p>
<p><disp-formula id="eqn-268"><label>(268)</label><mml:math id="mml-eqn-268" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:mtd><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>we obtain a compact representation of the equations of motion, which relates the temporal change of the system&#x2019;s state <inline-formula id="ieqn-1409"><mml:math id="mml-ieqn-1409"><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> to its current state <inline-formula id="ieqn-1410"><mml:math id="mml-ieqn-1410"><mml:mi mathvariant="bold-italic">q</mml:mi></mml:math></inline-formula> and the input <inline-formula id="ieqn-1411"><mml:math id="mml-ieqn-1411"><mml:mi>b</mml:mi></mml:math></inline-formula> linearly by means of</p>
<p><disp-formula id="eqn-269"><label>(269)</label><mml:math id="mml-eqn-269" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>+</mml:mo><mml:mi>B</mml:mi><mml:mi>f</mml:mi><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>We found similar relations in neuroscience,<xref ref-type="fn" rid="fn209"><sup>209</sup></xref><fn id="fn209"><label>209</label><p>See Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> on &#x201C;Dynamic, time dependence, Volterra series&#x201D;.</p></fn> where the dynamics of neurons was accounted for, e.g., in the pioneering work [<xref ref-type="bibr" rid="ref-214">214</xref>], whose author modeled a neuron as electrical circuit with capacitances. Time-continuous RNNs were considered in a paper on back-propagation [<xref ref-type="bibr" rid="ref-215">215</xref>]. The temporal change of an RNN&#x2019;s state is related to the current state <inline-formula id="ieqn-1412"><mml:math id="mml-ieqn-1412"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> and the input <inline-formula id="ieqn-1413"><mml:math id="mml-ieqn-1413"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> by</p>
<p><disp-formula id="eqn-270"><label>(270)</label><mml:math id="mml-eqn-270" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>+</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1414"><mml:math id="mml-ieqn-1414"><mml:mi>a</mml:mi></mml:math></inline-formula> denotes a non-linear activation function as, e.g., the sigmoid function, and <inline-formula id="ieqn-1415"><mml:math id="mml-ieqn-1415"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula> is the weight matrix that describes the connection among neurons.<xref ref-type="fn" rid="fn210"><sup>210</sup></xref><fn id="fn210"><label>210</label><p>See the general time-continuous neural network with a continuous delay described by Eq. (<xref ref-type="disp-formula" rid="eqn-514">514</xref>).</p></fn></p>
<p>Returning to mechanics, we are confronted with the problem that the equations of motion do not admit closed-form solutions in general. To construct approximate solution, time-integration schemes need to be resorted to, where we mention a few examples such as Newmark&#x2019;s method [<xref ref-type="bibr" rid="ref-216">216</xref>], the Hilber-Hughes-Taylor (HHT) method [<xref ref-type="bibr" rid="ref-217">217</xref>], and the generalized-<inline-formula id="ieqn-1416"><mml:math id="mml-ieqn-1416"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> method [<xref ref-type="bibr" rid="ref-218">218</xref>]. For simplicity, we have a closer look at the classical trapezoidal rule, in which the time integral of some function <inline-formula id="ieqn-1417"><mml:math id="mml-ieqn-1417"><mml:mi>f</mml:mi></mml:math></inline-formula> over a time step <inline-formula id="ieqn-1418"><mml:math id="mml-ieqn-1418"><mml:mo>&#x0394;</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:math></inline-formula> is approximated by the average of the function values <inline-formula id="ieqn-1419"><mml:math id="mml-ieqn-1419"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-1420"><mml:math id="mml-ieqn-1420"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. If we apply the trapezoidal rule to the system of ODEs in Eq. (<xref ref-type="disp-formula" rid="eqn-269">269</xref>), we obtain a system of algebraic relations that reads</p>
<p><disp-formula id="eqn-271"><label>(271)</label><mml:math id="mml-eqn-271" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">A</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">A</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Rearranging the above relation for the new state gives the update equation for the state vector,</p>
<p><disp-formula id="eqn-272"><label>(272)</label><mml:math id="mml-eqn-272" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which determines the next state <inline-formula id="ieqn-1421"><mml:math id="mml-ieqn-1421"><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in terms of the state <inline-formula id="ieqn-1422"><mml:math id="mml-ieqn-1422"><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> as well as the inputs <inline-formula id="ieqn-1423"><mml:math id="mml-ieqn-1423"><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1424"><mml:math id="mml-ieqn-1424"><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>.<xref ref-type="fn" rid="fn211"><sup>211</sup></xref><fn id="fn211"><label>211</label><p>We can regard the trapezoidal rule as a combination of Euler&#x2019;s explicit and implicit methods. The <italic>explicit Euler</italic> method approximates time-integrals by means of rates (and inputs) at the beginning of a time step. The next state (at the end of a time step) is obtained from previous state and the previous input as <inline-formula id="ieqn-3268"><mml:math id="mml-ieqn-3268"><mml:msub><mml:mi mathvariant='bold-italic'>q</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> <inline-formula id="ieqn-3269"><mml:math id="mml-ieqn-3269"><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:math></inline-formula> On the contrary, the <italic>implicit Euler</italic> method uses rates (and inputs) at the end of a time step, which leads to the update relation <inline-formula id="ieqn-3270"><mml:math id="mml-ieqn-3270"><mml:msub><mml:mi mathvariant='bold-italic'>q</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> <inline-formula id="ieqn-3271"><mml:math id="mml-ieqn-3271"><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:math></inline-formula></p></fn> To keep it short, we introduce the matrices <inline-formula id="ieqn-1425"><mml:math id="mml-ieqn-1425"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1426"><mml:math id="mml-ieqn-1426"><mml:mi mathvariant="bold-italic">U</mml:mi></mml:math></inline-formula> as</p>
<p><disp-formula id="eqn-273"><label>(273)</label><mml:math id="mml-eqn-273" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which allows us to rewrite Eq. (<xref ref-type="disp-formula" rid="eqn-272">272</xref>) as</p>
<p><disp-formula id="eqn-274"><label>(274)</label><mml:math id="mml-eqn-274" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The update equation of time-discrete RNNs is similar to the discretized equations of motion Eq. (<xref ref-type="disp-formula" rid="eqn-274">274</xref>). Unlike feed-forward neural networks, the state <inline-formula id="ieqn-1427"><mml:math id="mml-ieqn-1427"><mml:mi>h</mml:mi></mml:math></inline-formula> of an RNN at the <inline-formula id="ieqn-1428"><mml:math id="mml-ieqn-1428"><mml:mi>n</mml:mi></mml:math></inline-formula>-th time step,<xref ref-type="fn" rid="fn212"><sup>212</sup></xref><fn id="fn212"><label>212</label><p>Though the elements <inline-formula id="ieqn-3272"><mml:math id="mml-ieqn-3272"><mml:msup><mml:mi mathvariant='bold-italic'>x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of a sequence are commonly referred to as &#x201C;time steps&#x201D;, the nature of a sequence is not necessarily temporal. The time step index <inline-formula id="ieqn-3273"><mml:math id="mml-ieqn-3273"><mml:mi>t</mml:mi></mml:math></inline-formula> then merely refers to a position within some given sequence.</p></fn> which is denoted by <inline-formula id="ieqn-1429"><mml:math id="mml-ieqn-1429"><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, does not only depend on the current input <inline-formula id="ieqn-1430"><mml:math id="mml-ieqn-1430"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, but also on the state <inline-formula id="ieqn-1431"><mml:math id="mml-ieqn-1431"><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of the previous time step <inline-formula id="ieqn-1432"><mml:math id="mml-ieqn-1432"><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. Following the notation in [<xref ref-type="bibr" rid="ref-78">78</xref>], we introduce a <italic>transition function</italic> <inline-formula id="ieqn-1433"><mml:math id="mml-ieqn-1433"><mml:mi>f</mml:mi></mml:math></inline-formula> that produces the new state,</p>
<p><disp-formula id="eqn-275"><label>(275)</label><mml:math id="mml-eqn-275" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<statement id="st7_1"><title>Remark 7.1.</title>
<p> In [<xref ref-type="bibr" rid="ref-78">78</xref>], there is a distinction between the &#x201C;hidden state&#x201D; of an RNN cell at the <inline-formula id="ieqn-1434"><mml:math id="mml-ieqn-1434"><mml:mi>n</mml:mi></mml:math></inline-formula>-th step denoted by <inline-formula id="ieqn-1435"><mml:math id="mml-ieqn-1435"><mml:msup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, where &#x201C;<inline-formula id="ieqn-1436"><mml:math id="mml-ieqn-1436"><mml:mi>h</mml:mi></mml:math></inline-formula>&#x201D; is mnemonic for &#x201C;hidden&#x201D;, and the cell&#x2019;s output <inline-formula id="ieqn-1437"><mml:math id="mml-ieqn-1437"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. The output of a multi-layer RNN is a linear combination of the last layer&#x2019;s hidden state. Depending on the application, the output is not necessarily computed at every time step, but the network can &#x201C;summarize&#x201D; sequences of inputs to produce an output after a certain number of steps.<xref ref-type="fn" rid="fn213"><sup>213</sup></xref><fn id="fn213"><label>213</label><p>cf. [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 371, Figure 10.5.</p></fn> If the output is identical to the hidden state,</p>
<p><disp-formula id="eqn-276"><label>(276)</label><mml:math id="mml-eqn-276" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-1438"><mml:math id="mml-ieqn-1438"><mml:msup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1439"><mml:math id="mml-ieqn-1439"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> can be used interchangeably. In the current Section <xref ref-type="sec" rid="s7">7</xref> on &#x201C;Dynamics, sequential data sequence modeling&#x201D;, the notation <inline-formula id="ieqn-1440"><mml:math id="mml-ieqn-1440"><mml:msup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is used, whereas in Section <xref ref-type="sec" rid="s4">4</xref> on &#x201C;Static, feedforward networks&#x201D;, the notation <inline-formula id="ieqn-1441"><mml:math id="mml-ieqn-1441"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is used to designate the output of the &#x201C;hidden layer&#x201D; <inline-formula id="ieqn-1442"><mml:math id="mml-ieqn-1442"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, keeping in mind the equivalence in Eq. (<xref ref-type="disp-formula" rid="eqn-22">22</xref>) in Remark <xref ref-type="statement" rid="st4_2">4.2</xref>. whenever necessary, readers are reminded of the equivalence in Eq. (<xref ref-type="disp-formula" rid="eqn-276">276</xref>) to avoid possible confusion when reading deep-learning literature.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>The above relation is illustrated as a circular graph in Figure <xref ref-type="fig" rid="fig-79">79</xref> (left), where the delay is explicit in the superscript of <inline-formula id="ieqn-1443"><mml:math id="mml-ieqn-1443"><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. The hidden state <inline-formula id="ieqn-1444"><mml:math id="mml-ieqn-1444"><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> at <inline-formula id="ieqn-1445"><mml:math id="mml-ieqn-1445"><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, in turn, is a function of the hidden state <inline-formula id="ieqn-1446"><mml:math id="mml-ieqn-1446"><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the input <inline-formula id="ieqn-1447"><mml:math id="mml-ieqn-1447"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>,</p>
<p><disp-formula id="eqn-277"><label>(277)</label><mml:math id="mml-eqn-277" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Continuing this <italic>unfolding</italic> process repeatedly until we reach the beginning of a sequence, the recurrence can be expressed as a function <inline-formula id="ieqn-1448"><mml:math id="mml-ieqn-1448"><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>,</p>
<p><disp-formula id="eqn-278"><label>(278)</label><mml:math id="mml-eqn-278" display="block"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>n</mml:mi><mml:mtext>&#x2009;</mml:mtext></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mi>n</mml:mi> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mi>n</mml:mi> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x200B;&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mn>2</mml:mn> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mn>1</mml:mn> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mn>0</mml:mn> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mi>n</mml:mi> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mi>k</mml:mi> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mn>0</mml:mn> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>which takes the entire sequence up to the current step <inline-formula id="ieqn-1449"><mml:math id="mml-ieqn-1449"><mml:mi>n</mml:mi></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-1450"><mml:math id="mml-ieqn-1450"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mi>k</mml:mi> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> (along with an initial state <italic><bold>h</bold></italic><sup>[0]</sup><inline-formula id="ieqn-1451"></inline-formula> and parameters <inline-formula id="ieqn-1452"><mml:math id="mml-ieqn-1452"><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi></mml:math></inline-formula>), as input to compute the current state <inline-formula id="ieqn-1453"><mml:math id="mml-ieqn-1453"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. The unfolded graph representation is illustrated in Figure <xref ref-type="fig" rid="fig-79">79</xref> (right).</p>
<fig id="fig-79">
<label>Figure 79</label>
<caption><title><italic>Folded and unfolded discrete RNN</italic> (Section <xref ref-type="sec" rid="s7_1">7.1</xref>, <xref ref-type="sec" rid="s13_2_2">13.2.2</xref>). <italic>Left:</italic> <italic>Folded discrete RNN</italic> at configuration (or state) number <inline-formula id="ieqn-712"><mml:math id="mml-ieqn-712"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-713"><mml:math id="mml-ieqn-713"><mml:mi>k</mml:mi></mml:math></inline-formula> is an integer, with input <inline-formula id="ieqn-714"><mml:math id="mml-ieqn-714"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> to a multilayer neural network <inline-formula id="ieqn-715"><mml:math id="mml-ieqn-715"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2218;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2218;</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>&#x2218;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-18">18</xref>), having a feedback loop <inline-formula id="ieqn-716"><mml:math id="mml-ieqn-716"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> with delay by one step, to produce output <inline-formula id="ieqn-717"><mml:math id="mml-ieqn-717"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. <italic>Right:</italic> <italic>Unfolded discrete RNN</italic>, where the feedback loop is unfolded, centered at <inline-formula id="ieqn-718"><mml:math id="mml-ieqn-718"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, as represented by Eq. (<xref ref-type="disp-formula" rid="eqn-275">275</xref>). This graphical representation, with <inline-formula id="ieqn-719"><mml:math id="mml-ieqn-719"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> being a multilayer neural network, is more general than Figure 10.2 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 366, and is a particular case of Figure 10.13b in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 388. See also the continuous RNN explained in Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> on &#x201C;Dynamic, time dependence, Volterra series&#x201D;, Eq. (<xref ref-type="disp-formula" rid="eqn-514">514</xref>), Figure <xref ref-type="fig" rid="fig-135">135</xref>, for which we refer readers to Remark <xref ref-type="statement" rid="st7_1">7.1</xref> and the notation equivalence <inline-formula id="ieqn-720"><mml:math id="mml-ieqn-720"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-276">276</xref>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-79.tif"/>
</fig>
<p>As an example, consider the default <italic>(&#x201C;vanilla&#x201D;)</italic> single-layer RNN provided by PyTorch<xref ref-type="fn" rid="fn214"><sup>214</sup></xref><fn id="fn214"><label>214</label><p>See PyTorch documentation: Recurrent layers, <ext-link ext-link-type="uri" xlink:href="https://pytorch.org/docs/stable/nn.html#recurrent-layers">Original website</ext-link> (<ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20221113090517/https://pytorch.org/docs/stable/nn.html#recurrent-layers">Internet archive</ext-link>)</p></fn> and TensorFlow<xref ref-type="fn" rid="fn215"><sup>215</sup></xref><fn id="fn215"><label>215</label><p>See TensorFlow API: TensorFlow Core r1.14: tf.keras.layers.SimpleRNN, <ext-link ext-link-type="uri" xlink:href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN">Original website</ext-link> (<ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220405022833/https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN">Internet archive</ext-link>)</p></fn>, which is also described in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 370:</p>
<p><disp-formula id="eqn-279"><label>(279)</label><mml:math id="mml-eqn-279" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>First, <inline-formula id="ieqn-1454"><mml:math id="mml-ieqn-1454"><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is formed from an affine transformation of the current input <inline-formula id="ieqn-1455"><mml:math id="mml-ieqn-1455"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, the previous hidden state <inline-formula id="ieqn-1456"><mml:math id="mml-ieqn-1456"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the bias <inline-formula id="ieqn-1457"><mml:math id="mml-ieqn-1457"><mml:mi mathvariant="bold-italic">b</mml:mi></mml:math></inline-formula> using weight matrices <inline-formula id="ieqn-1458"><mml:math id="mml-ieqn-1458"><mml:mi mathvariant="bold-italic">U</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1459"><mml:math id="mml-ieqn-1459"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula>, respectively. Subsequently, the hyperbolic tangent is applied to the elements of <inline-formula id="ieqn-1460"><mml:math id="mml-ieqn-1460"><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> as activation function that produces the new hidden state <inline-formula id="ieqn-1461"><mml:math id="mml-ieqn-1461"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>A common design pattern of RNNs adds a linear output layer to the simplistic example in Figure <xref ref-type="fig" rid="fig-79">79</xref>, i.e., the RNN has a recurrent connection between its hidden units, which represent the state <inline-formula id="ieqn-1462"><mml:math id="mml-ieqn-1462"><mml:mi mathvariant="bold-italic">h</mml:mi></mml:math></inline-formula>,<xref ref-type="fn" rid="fn216"><sup>216</sup></xref><fn id="fn216"><label>216</label><p>For this reason, <inline-formula id="ieqn-3274"><mml:math id="mml-ieqn-3274"><mml:mi mathvariant='bold-italic'>h</mml:mi></mml:math></inline-formula> is typically referred to as the <italic>hidden state</italic> of an RNN.</p></fn> and produces an output at each time step.<xref ref-type="fn" rid="fn217"><sup>217</sup></xref><fn id="fn217"><label>217</label><p>Such neural network is <italic>universal</italic>, i.e., any function computable by a <italic>Turing machine</italic> can be computed by an RNN of finite size, see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 368.</p></fn> Figure <xref ref-type="fig" rid="fig-80">80</xref> shows a two-layer RNN, which extends our above example by a second layer, i.e., the first layer is identical to Eq. (<xref ref-type="disp-formula" rid="eqn-279">279</xref>),</p>
<p><disp-formula id="eqn-280"><label>(280)</label><mml:math id="mml-eqn-280" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and the second layer forms a linear combination of the first layer&#x2019;s output <inline-formula id="ieqn-1463"><mml:math id="mml-ieqn-1463"><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and the bias <inline-formula id="ieqn-1464"><mml:math id="mml-ieqn-1464"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula> using weights <inline-formula id="ieqn-1465"><mml:math id="mml-ieqn-1465"><mml:mi mathvariant="bold-italic">V</mml:mi></mml:math></inline-formula>. Assuming the output <inline-formula id="ieqn-1466"><mml:math id="mml-ieqn-1466"><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is meant to represent probabilities, it is input to a <inline-formula id="ieqn-1467"><mml:math id="mml-ieqn-1467"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> activation (a derivation of which is provided in Remark <xref ref-type="statement" rid="st5_3">5.3</xref> in Section <xref ref-type="sec" rid="s5_1_3">5.1.3</xref>):</p>
<p><disp-formula id="eqn-281"><label>(281)</label><mml:math id="mml-eqn-281" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">V</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which was given in Eq. (<xref ref-type="disp-formula" rid="eqn-84">84</xref>); see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 369. The parameters of the network are the weight matrices <inline-formula id="ieqn-1468"><mml:math id="mml-ieqn-1468"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-1469"><mml:math id="mml-ieqn-1469"><mml:mi mathvariant="bold-italic">U</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1470"><mml:math id="mml-ieqn-1470"><mml:mi mathvariant="bold-italic">V</mml:mi></mml:math></inline-formula>, as well as the biases <inline-formula id="ieqn-1471"><mml:math id="mml-ieqn-1471"><mml:mi mathvariant="bold-italic">b</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1472"><mml:math id="mml-ieqn-1472"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula> of the recurrence layer and the output layer, respectively.</p>
<fig id="fig-80">
<label>Figure 80</label>
<caption><title><italic>RNN with two multilayer neural networks (MLNs),</italic> (Section <xref ref-type="sec" rid="s7_1">7.1</xref>) denoted by <inline-formula id="ieqn-721"><mml:math id="mml-ieqn-721"><mml:msub><mml:mi>f</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-722"><mml:math id="mml-ieqn-722"><mml:msub><mml:mi>f</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, whose outputs are fed into the loss function for optimization. This RNN is a generalization of the RNN in Figure <xref ref-type="fig" rid="fig-79">79</xref>, and includes the RNN in Figure 10.3 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 369, as a particular case, where Eq. (<xref ref-type="disp-formula" rid="eqn-282">282</xref>) of the first layer is simply <inline-formula id="ieqn-723"><mml:math id="mml-ieqn-723"><mml:msubsup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-724"><mml:math id="mml-ieqn-724"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-725"><mml:math id="mml-ieqn-725"><mml:msub><mml:mi>a</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, whereas Eq. (<xref ref-type="disp-formula" rid="eqn-282">282</xref>) is <inline-formula id="ieqn-726"><mml:math id="mml-ieqn-726"><mml:msubsup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-727"><mml:math id="mml-ieqn-727"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">V</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-728"><mml:math id="mml-ieqn-728"><mml:msub><mml:mi>a</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for the second layer. The general relation for both MLNs is <inline-formula id="ieqn-729"><mml:math id="mml-ieqn-729"><mml:msubsup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, for <inline-formula id="ieqn-730"><mml:math id="mml-ieqn-730"><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-80.tif"/>
</fig>
<p>Irrespective of the number of layers, the hidden state <inline-formula id="ieqn-1473"><mml:math id="mml-ieqn-1473"><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> of the <inline-formula id="ieqn-1474"><mml:math id="mml-ieqn-1474"><mml:mi>j</mml:mi></mml:math></inline-formula>-th layer is gennerally computed from the previous hidden state <inline-formula id="ieqn-1475"><mml:math id="mml-ieqn-1475"><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and the input to the layer <inline-formula id="ieqn-1476"><mml:math id="mml-ieqn-1476"><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>,</p>
<p>
<disp-formula id="eqn-282"><label>(282)</label><mml:math id="mml-eqn-282" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Other design patterns for RNNs show, e.g., recurrent connections between the hidden units but produce a single output only. RNNs may also have recurrent connections from the output at one time step to the hidden unit of next time step.</p>
<p>Comparing the recurrence relation in Eq. (<xref ref-type="disp-formula" rid="eqn-277">277</xref>) and its unfolded representation in Eq. (<xref ref-type="disp-formula" rid="eqn-278">278</xref>), we can make the following observations:</p>
<list list-type="bullet">
<list-item><p>The unfolded representation after <inline-formula id="ieqn-1477"><mml:math id="mml-ieqn-1477"><mml:mi>n</mml:mi></mml:math></inline-formula> steps <inline-formula id="ieqn-1478"><mml:math id="mml-ieqn-1478"><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> can be regarded as a factorization into repeated applications of <inline-formula id="ieqn-1479"><mml:math id="mml-ieqn-1479"><mml:mi>f</mml:mi></mml:math></inline-formula>. Unlike <inline-formula id="ieqn-1480"><mml:math id="mml-ieqn-1480"><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, the transition function <inline-formula id="ieqn-1481"><mml:math id="mml-ieqn-1481"><mml:mi>f</mml:mi></mml:math></inline-formula> does not depend on the length of the sequence and always has the same input size.</p></list-item>
<list-item><p>The same transition function <inline-formula id="ieqn-1482"><mml:math id="mml-ieqn-1482"><mml:mi>f</mml:mi></mml:math></inline-formula> with the same parameters <inline-formula id="ieqn-1483"><mml:math id="mml-ieqn-1483"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> is used in every time step.</p></list-item>
<list-item><p>A state <inline-formula id="ieqn-1484"><mml:math id="mml-ieqn-1484"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> contains information about the whole past sequence.</p></list-item></list> 
<statement id="st7_2"><title>Remark 7.2.</title>
<p><italic>Depth of RNNs</italic>. For the above reasons and Figures 79-80, &#x201C;RNNs, once unfolded in time, can be seen as very deep feedforward networks in which all the layers share the same weights&#x201D; [<xref ref-type="bibr" rid="ref-13">13</xref>]. See Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref> on network depth and Remark <xref ref-type="statement" rid="st4_5">4.5</xref>. &#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>By nature, RNNs are typically employed for the processing of sequential data <inline-formula id="ieqn-1486"><mml:math id="mml-ieqn-1486"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, where the sequence length <inline-formula id="ieqn-1485"><mml:math id="mml-ieqn-1485"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> typically need not be constant. To process data of variable length, parameter sharing is a fundamental concept that characterizes RNNs. Instead of using separate parameters for each time step in a sequence, the same parameters are shared across several time-steps. The idea of parameter sharing does not only allows us to process sequences of variable length (and possibly not seen during training), the <italic>&#x201C;statistical strength&#x201D;</italic><xref ref-type="fn" rid="fn218"><sup>218</sup></xref><fn id="fn218"><label>218</label><p>see [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 363.</p></fn> is also shared across different positions in time, which is important if relevant information occurs at different positions within a sequence. A fully-connected feedforward neural network that takes each element of a sequence as input instead needs to learn all its rules separately for each position in the sequence.</p>
<p>Comparing the update equations Eq. (<xref ref-type="disp-formula" rid="eqn-274">274</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-279">279</xref>), we note the close resemblance of dynamic systems and RNNs. Let aside the non-linearity of the activation function and the presence of the bias vector, both have state vectors with recurrent connections to the previous states. Employing the trapezoidal rule for the time-discretization, we find a recurrence in the input, which is not present in the type of RNNs described above. The concept of parameter sharing in RNNs translates into the notion of time-invariant systems in dynamics, i.e., the state matrix <inline-formula id="ieqn-1487"><mml:math id="mml-ieqn-1487"><mml:mi mathvariant="bold-italic">A</mml:mi></mml:math></inline-formula> does not depend on time. In the computational mechanics context, typical outputs of a simulation could be, e.g., the displacement of a structure at some specified point or the von-Mises stress field in its interior. The computations required for determining the output from the state (e.g., nodal displacements of a finite-element model) depend on the respective nature of the output quantity and need not be linear.</p>
<p>The crucial challenge in RNNs is to learn long-term dependencies, i.e., relations among distant elements in input sequences. For long sequences, we face the problem of vanishing or exploding gradients when training the network by means of back-propagation. To understand vanishing (or exploding) gradients, we can draw analogies between RNNs and dynamic systems once again. For this purpose, we consider an RNN without inputs whose activation function is the identity function:</p>
<p><disp-formula id="eqn-283"><label>(283)</label><mml:math id="mml-eqn-283" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>From the dynamics point of view, the above update equation corresponds to a linear autonomous system, whose time-discrete representation is given by</p>
<p><disp-formula id="eqn-284"><label>(284)</label><mml:math id="mml-eqn-284" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Clearly, the equilibrium state of the above system is the trivial state <inline-formula id="ieqn-1488"><mml:math id="mml-ieqn-1488"><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn></mml:mrow></mml:math></inline-formula>. Let <inline-formula id="ieqn-1489"><mml:math id="mml-ieqn-1489"><mml:msub><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> denote a perturbation of the equilibrium state. An equilibrium state is called <italic>Lyapunov stable</italic> if trajectories of the system, i.e., the states at times <inline-formula id="ieqn-1490"><mml:math id="mml-ieqn-1490"><mml:mi>t</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, remain bounded. If the trajectory eventually arrives at the equilibrium state for <inline-formula id="ieqn-1491"><mml:math id="mml-ieqn-1491"><mml:mi>t</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula> (i.e., the trajectory is attractive), the equilibrium of the system is called <italic>asymptotically stable</italic>. In other words, an initial perturbation <inline-formula id="ieqn-1492"><mml:math id="mml-ieqn-1492"><mml:msub><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> from the equilibrium state (i.e., the initial state) vanishes over time in the case of asymptotic stability. Linear time-discrete systems are asymptotically stable if all eigenvalues <inline-formula id="ieqn-1493"><mml:math id="mml-ieqn-1493"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> of <inline-formula id="ieqn-1494"><mml:math id="mml-ieqn-1494"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula> have an absolute value smaller than one. From the unfolded representation in Eq. (<xref ref-type="disp-formula" rid="eqn-278">278</xref>) (see also Figure <xref ref-type="fig" rid="fig-79">79</xref> (right)), it is understood that we observe a similar behavior in the RNN described above. At step <inline-formula id="ieqn-1495"><mml:math id="mml-ieqn-1495"><mml:mi>n</mml:mi></mml:math></inline-formula>, the initial state <inline-formula id="ieqn-1496"><mml:math id="mml-ieqn-1496"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> has been multiplied <inline-formula id="ieqn-1497"><mml:math id="mml-ieqn-1497"><mml:mi>n</mml:mi></mml:math></inline-formula> times with the weight matrix <inline-formula id="ieqn-1498"><mml:math id="mml-ieqn-1498"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula>, i.e.,</p>
<p><disp-formula id="eqn-285"><label>(285)</label><mml:math id="mml-eqn-285" display="block"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>n</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>If eigenvalues of <inline-formula id="ieqn-1499"><mml:math id="mml-ieqn-1499"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula> have absolute values smaller than one, <inline-formula id="ieqn-1500"><mml:math id="mml-ieqn-1500"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> exponentially decays to zero in long sequences. On the other hand, we encounter exponentially increasing values for eigenvalues of <inline-formula id="ieqn-1501"><mml:math id="mml-ieqn-1501"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula> with magnitudes greater than one, which is equivalent to an unstable system in dynamics. When performing back-propagation to train RNNs, gradients of the loss function need to be passed backwards through the unfolded network, where gradients are repeatedly multiplied with <inline-formula id="ieqn-1502"><mml:math id="mml-ieqn-1502"><mml:mi mathvariant="bold-italic">W</mml:mi></mml:math></inline-formula> as is the initial state <inline-formula id="ieqn-1503"><mml:math id="mml-ieqn-1503"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> in the forward pass. The exponential decay (or increase) therefore causes gradients to vanish (or explode) in long sequences, which makes it difficult to learn long-term dependencies among distant elements of a sequence.</p>
<fig id="fig-81">
<label>Figure 81</label>
<caption><title><italic>Folded Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) cell</italic> (Section <xref ref-type="sec" rid="s7_2">7.2</xref>, <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>). The cell state at <inline-formula id="ieqn-731"><mml:math id="mml-ieqn-731"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is denoted by <inline-formula id="ieqn-732"><mml:math id="mml-ieqn-732"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. Two feedback loops, one for cell state <inline-formula id="ieqn-733"><mml:math id="mml-ieqn-733"><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi></mml:msub></mml:math></inline-formula> and one for hidden state <inline-formula id="ieqn-734"><mml:math id="mml-ieqn-734"><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:math></inline-formula>, with one-step delay <inline-formula id="ieqn-735"><mml:math id="mml-ieqn-735"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. The key unified recurring relation is <inline-formula id="ieqn-736"><mml:math id="mml-ieqn-736"><mml:msub><mml:mi>&#x2131;</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi mathvariant="bold-italic">&#x03B1;</mml:mi><mml:mrow><mml:mo mathvariant="bold" stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">k</mml:mi><mml:mo mathvariant="bold" stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-737"><mml:math id="mml-ieqn-737"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mtext>&#xA0;(state)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mtext>&#xA0;(forget)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#xA0;(Input)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:mrow><mml:mtext>&#xA0;(external input)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#xA0;(Output)</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-738"><mml:math id="mml-ieqn-738"><mml:msub><mml:mi>A</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula> is a sigmoidal activation (squashing) function, and <inline-formula id="ieqn-739"><mml:math id="mml-ieqn-739"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is a linear combination of some inputs with weights plus biases at cell state <inline-formula id="ieqn-740"><mml:math id="mml-ieqn-740"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. See Figure <xref ref-type="fig" rid="fig-82">82</xref> for unfolded RNN with LSTM cells, and also Figure <xref ref-type="fig" rid="fig-15">15</xref> in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-81.tif"/>
</fig>
</sec>
<sec id="s7_2"><label>7.2</label>
<title>Long Short-Term Memory (LSTM) unit</title>
<p>The vanishing (exploding) gradient problem prevents us from effectively learning long-term dependencies in long input sequences by means of conventional RNNs. Gated RNNs as the long short-term memory (LSTM) and networks based on the gated recurrent unit (GRU) have proven to successfully overcome the vanishing gradient problem in diverse applications. The common idea of gated RNNs is to create paths through time along which gradients neither vanish nor explode. Gated RNNs can accumulate information in their state over many time steps, but, once the information has been used, they are capable to forget their state by, figuratively speaking, &#x201C;closing gates&#x201D; to stop the information flow. This concept bears a resemblance to residual networks, which introduce skip connections to circumvent vanishing gradients in deep feed-forward networks; see Section <xref ref-type="sec" rid="s4_6_2">4.6.2</xref> on network &#x201C;Architecture&#x201D;.</p> 
<statement id="st7_3"><title>Remark 7.3.</title>
<p><italic>What is &#x201C;short-term&#x201D; memory?</italic> The vanishing gradient at the earlier states of an RNN (or layers in the case of a multilayer neural network) makes it that information in these earlier states (or layers) did not propagate forward to contribute to adjust the predicted outputs to track the labeled outputs, so to decrease the loss. A state <inline-formula id="ieqn-1504"><mml:math id="mml-ieqn-1504"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> with two feedback loops is depicted in Figure <xref ref-type="fig" rid="fig-81">81</xref>. The reason for information in the earlier states not propagating forward was because the weights in these layers did not change much (i.e., did not learn), due to very small gradients, due to repeated multiplications of numbers with magnitude less than 1. As a result, information in earlier states (or &#x201C;events&#x201D;) played little role in decreasing the loss function, and thus had only &#x201C;short-term&#x201D; effects, rather than the needed long-term effect to be carried forward to the output layer. Hence, we had a short-term memory problem. See also Remark <xref ref-type="statement" rid="st5_5">5.5</xref> in Section <xref ref-type="sec" rid="s5">5</xref> on back-propagation for vanishing or exploding gradient in multilayer neural networks.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>In their pioneering work on LSTM, the authors of [<xref ref-type="bibr" rid="ref-24">24</xref>] presented a mechanism that allows information (inputs, gradients) to flow over a long duration by introducing additional states, paths and self-loops. The additional components are encapsulated in so-called <italic>LSTM cells</italic>. LSTM cells are the building blocks for LSTM networks, where they are connected recurrently to each other analogously to hidden neurons in conventional RNNs. The introduction of a <italic>cell state</italic><xref ref-type="fn" rid="fn219"><sup>219</sup></xref><fn id="fn219"><label>219</label><p>The cell state is denoted with the variable <inline-formula id="ieqn-3275"><mml:math id="mml-ieqn-3275"><mml:mi>s</mml:mi></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 399, Eq. (10.41).</p></fn> <inline-formula id="ieqn-1505"><mml:math id="mml-ieqn-1505"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula> is one of the key ingredients to LSTM. The schematic cell representation (Figure <xref ref-type="fig" rid="fig-81">81</xref>) shows that the cell state can propagate through an LSTM cell without much interference, which is why this path is described as &#x201C;conveyor belt&#x201D; for information in [<xref ref-type="bibr" rid="ref-219">219</xref>].</p>
<p>Another way to explain that could contribute to elucidate the concept is: Since information in RNN cannot be stored for a long time, over many subsequent steps, LSTM cell corrected this short-term memory problem by remembering the inputs for a long time:</p>
<disp-quote><p>&#x201C;A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.&#x201D; [<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
</disp-quote><p>In a unified manner, the various relations in the original LSTM unit depicted in Figure <xref ref-type="fig" rid="fig-81">81</xref> can be expressed in a single key generic recurring-relation that is more easily remembered:</p>
<p><disp-formula id="eqn-286"><label>(286)</label><mml:math id="mml-eqn-286" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mtext>&#xA0;(state)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mtext>&#xA0;(forget)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#xA0;(Input)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:mrow><mml:mtext>&#xA0;(external input)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#xA0;(Output)</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1506"><mml:math id="mml-ieqn-1506"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is the input at cell state <inline-formula id="ieqn-1507"><mml:math id="mml-ieqn-1507"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-1508"><mml:math id="mml-ieqn-1508"><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> the hidden variable at cell state <inline-formula id="ieqn-1509"><mml:math id="mml-ieqn-1509"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo mathvariant="bold-italic">&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-1510"><mml:math id="mml-ieqn-1510"><mml:msub><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula> (with &#x201C;<inline-formula id="ieqn-1511"><mml:math id="mml-ieqn-1511"><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:math></inline-formula>&#x201D; being mnemonic for &#x201C;activation&#x201D;) is a sigmoidal activation (squashing) function&#x2013;which can be either the logistic sigmoid function or the hyperbolic tangent function (see Section <xref ref-type="sec" rid="s5_3_1">5.3.1</xref>)&#x2013;and <inline-formula id="ieqn-1512"><mml:math id="mml-ieqn-1512"><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is a linear combination of some inputs with weights plus biases at cell state <inline-formula id="ieqn-1513"><mml:math id="mml-ieqn-1513"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. The choice of the notation in Eq. (<xref ref-type="disp-formula" rid="eqn-286">286</xref>) is to be consistent with the notation in the relation <inline-formula id="ieqn-1514"><mml:math id="mml-ieqn-1514"><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>a</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>a</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> in the caption of Figure <xref ref-type="fig" rid="fig-32">32</xref>.</p>
<p>In Figure <xref ref-type="fig" rid="fig-81">81</xref>, two types of squashing functions are used: One type (three blue boxes with <inline-formula id="ieqn-1515"><mml:math id="mml-ieqn-1515"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>) squashes inputs into the range <inline-formula id="ieqn-1516"><mml:math id="mml-ieqn-1516"><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (e.g., the logistic sigmoid, Eq. (<xref ref-type="disp-formula" rid="eqn-113">113</xref>)), and the other type (purple box with <inline-formula id="ieqn-1517"><mml:math id="mml-ieqn-1517"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>g</mml:mi></mml:math></inline-formula>, brown box with <inline-formula id="ieqn-1518"><mml:math id="mml-ieqn-1518"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">s</mml:mi></mml:math></inline-formula>) squashes inputs into the range <inline-formula id="ieqn-1519"><mml:math id="mml-ieqn-1519"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (e.g., the hyperbolic tangent, Eq. (<xref ref-type="disp-formula" rid="eqn-114">114</xref>)). The <italic>gates</italic> are the activation functions <inline-formula id="ieqn-1520"><mml:math id="mml-ieqn-1520"><mml:msub><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula> with <inline-formula id="ieqn-1521"><mml:math id="mml-ieqn-1521"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> (3 blue and 1 purple boxes), with argument containing the input <inline-formula id="ieqn-1522"><mml:math id="mml-ieqn-1522"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> (through <inline-formula id="ieqn-1523"><mml:math id="mml-ieqn-1523"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:math></inline-formula>).</p> 
<statement id="st7_4"><title>Remark 7.4.</title>
<p>The activation function <inline-formula id="ieqn-1524"><mml:math id="mml-ieqn-1524"><mml:msub><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi></mml:msub></mml:math></inline-formula> (brown box in Figure <xref ref-type="fig" rid="fig-81">81</xref>) is a hyperbolic tangent, but is not called a gate, since it has the cell state <inline-formula id="ieqn-1525"><mml:math id="mml-ieqn-1525"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:math></inline-formula>, but not the input <inline-formula id="ieqn-1526"><mml:math id="mml-ieqn-1526"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, as argument. In other words, a gate has to take in the input <inline-formula id="ieqn-1527"><mml:math id="mml-ieqn-1527"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in its argument.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>There are two feedback loops, each with a delay by one step. The cell-state feedback loop (red) at the top involves the LSTM cell state <inline-formula id="ieqn-1528"><mml:math id="mml-ieqn-1528"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, with a delay by one step. Since the cell state <inline-formula id="ieqn-1529"><mml:math id="mml-ieqn-1529"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is <italic>not</italic> squashed by a sigmoidal activation function, vanishing or exploding gradient is avoided; the &#x201C;short-term&#x201D; memory would last longer, thus the name &#x201C;Long Short-Term Memory&#x201D;. See Remark <xref ref-type="statement" rid="st5_5">5.5</xref> on vanishing or exploding gradient in back-propagation, and Remark <xref ref-type="statement" rid="st7_3">7.3</xref> on short-term memory.</p>
<p>In the hidden-state feedback loop (green) at the bottom, the combination <inline-formula id="ieqn-1530"><mml:math id="mml-ieqn-1530"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> of the input <inline-formula id="ieqn-1531"><mml:math id="mml-ieqn-1531"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the previous hidden state <inline-formula id="ieqn-1532"><mml:math id="mml-ieqn-1532"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is squashed into the range <inline-formula id="ieqn-1533"><mml:math id="mml-ieqn-1533"><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by the processor <inline-formula id="ieqn-1534"><mml:math id="mml-ieqn-1534"><mml:msub><mml:mi>&#x2131;</mml:mi><mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> to form a factor that filters out less important information from the cell state <inline-formula id="ieqn-1535"><mml:math id="mml-ieqn-1535"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, which had been squashed into the range <inline-formula id="ieqn-1536"><mml:math id="mml-ieqn-1536"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. See Figure <xref ref-type="fig" rid="fig-82">82</xref> for an unfolded RNN with LSTM cells. See also Appendix <xref ref-type="sec" rid="s16">2</xref> and Figure <xref ref-type="fig" rid="fig-152">152</xref> for an alternative block diagram.</p>
<fig id="fig-82">
<label>Figure 82</label>
<caption><title><italic>Unfolded RNN with LSTM cells</italic> (Sections <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, <xref ref-type="sec" rid="s7_2">7.2</xref>, <xref ref-type="sec" rid="s12_1">12.1</xref>): In this unfolded RNN, the cell states are centered at the LSTM cell <inline-formula id="ieqn-741"><mml:math id="mml-ieqn-741"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, preceded by the LSTM cell <inline-formula id="ieqn-742"><mml:math id="mml-ieqn-742"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, and followed by the LSTM cell <inline-formula id="ieqn-743"><mml:math id="mml-ieqn-743"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. See Eq. (<xref ref-type="disp-formula" rid="eqn-290">290</xref>) for the recurring relation among the successive cell states, and Figure <xref ref-type="fig" rid="fig-81">81</xref> for a <italic>folded</italic> RNN with details of an LSTM cell. Unlike conventional RNNs, in which the hidden state is repeatedly multiplied with its shared weights, the additional cell state of an LSTM network can propagate over several time steps without much interference. For this reason, LSTM networks typically perform significantly better on long sequences as compared to conventional RNNs, which suffer from the problem of vanishing gradients when being trained. See also Figure <xref ref-type="fig" rid="fig-117">117</xref> in Section <xref ref-type="sec" rid="s12_2">12.2</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-82.tif"/>
</fig>
<p>As the term suggests, the presence of gates that control the information flow are a further key concept in gated RNNs and LSTM, in particular. Gates are constructed from a linear layer with a sigmoidal function (logistic sigmoid or hyperbolic tangent) as activation function that squashes the components of a vector into the range <inline-formula id="ieqn-1537"><mml:math id="mml-ieqn-1537"><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (logistic sigmoid) or <inline-formula id="ieqn-1538"><mml:math id="mml-ieqn-1538"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (hyperbolic tangent). A component-wise multiplication of the sigmoid&#x2019;s output with the cell state controls the evolution of the cell state (forward pass) and the flow of its gradients (backward pass), respectively. Multiplication with 0 suppresses the propagation of a component of the cell state, whereas a gate value of 1 allows a component to pass.</p>
<p>To understand the function of LSTM, we follow the paths information is routed through an LSTM cell. At time <inline-formula id="ieqn-1539"><mml:math id="mml-ieqn-1539"><mml:mi>n</mml:mi></mml:math></inline-formula>, we assume that the hidden state <inline-formula id="ieqn-1540"><mml:math id="mml-ieqn-1540"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the cell state <inline-formula id="ieqn-1541"><mml:math id="mml-ieqn-1541"><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> from the previous time step along with the current input <inline-formula id="ieqn-1542"><mml:math id="mml-ieqn-1542"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are given. The hidden state <inline-formula id="ieqn-1543"><mml:math id="mml-ieqn-1543"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the input <inline-formula id="ieqn-1544"><mml:math id="mml-ieqn-1544"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are inputs to the <italic>forget gate</italic>, i.e., a fully connected layer with a sigmoid non-linearity, and the general expression Eq. (<xref ref-type="disp-formula" rid="eqn-286">286</xref>), with <inline-formula id="ieqn-1545"><mml:math id="mml-ieqn-1545"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> (see Figure <xref ref-type="fig" rid="fig-82">82</xref>), becomes (Figure <xref ref-type="fig" rid="fig-81">81</xref>, where the superscript <inline-formula id="ieqn-1546"><mml:math id="mml-ieqn-1546"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> on <inline-formula id="ieqn-1547"><mml:math id="mml-ieqn-1547"><mml:msub><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula> was omitted to alleviate the notation)</p>
<p><disp-formula id="eqn-287"><label>(287)</label><mml:math id="mml-eqn-287" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The weights associated with the hidden state and the cell state are <inline-formula id="ieqn-1548"><mml:math id="mml-ieqn-1548"><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1549"><mml:math id="mml-ieqn-1549"><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula>, respectively; the bias vector of the forget gate is denoted by <inline-formula id="ieqn-1550"><mml:math id="mml-ieqn-1550"><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula>. The forget gate determines which and to what extent components of the previous cell state <inline-formula id="ieqn-1551"><mml:math id="mml-ieqn-1551"><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are to be kept subsequently. Knowing which information of the cell state to keep, the next step is to determine how to update the cell state. For this purpose, the hidden state <inline-formula id="ieqn-1552"><mml:math id="mml-ieqn-1552"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the input <inline-formula id="ieqn-1553"><mml:math id="mml-ieqn-1553"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are input to a linear layer with a hyperbolic tangent as activation function, called the <italic>external input gate</italic>, and the general expression Eq. (<xref ref-type="disp-formula" rid="eqn-286">286</xref>), with <inline-formula id="ieqn-1554"><mml:math id="mml-ieqn-1554"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, becomes (Figure <xref ref-type="fig" rid="fig-81">81</xref>)</p>
<p><disp-formula id="eqn-288"><label>(288)</label><mml:math id="mml-eqn-288" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Again, <inline-formula id="ieqn-1555"><mml:math id="mml-ieqn-1555"><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1556"><mml:math id="mml-ieqn-1556"><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> are linear weights and <inline-formula id="ieqn-1557"><mml:math id="mml-ieqn-1557"><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> represents the bias. The output vector of the tanh layer, which is also referred to as <italic>cell gate</italic>, can be regarded as a candidate for updates to the cell state.</p>
<p>The actual updates are determined by a component-wise multiplication of the candidate values <inline-formula id="ieqn-1559"><mml:math id="mml-ieqn-1559"><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> with the <italic>input gate</italic>, which has the same structure as the forget gate but has its own parameters (i.e., <inline-formula id="ieqn-1560"><mml:math id="mml-ieqn-1560"><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1561"><mml:math id="mml-ieqn-1561"><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1562"><mml:math id="mml-ieqn-1562"><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>), and the general expression Eq. (<xref ref-type="disp-formula" rid="eqn-286">286</xref>) becomes (Figure <xref ref-type="fig" rid="fig-81">81</xref>)</p>
<p><disp-formula id="eqn-289"><label>(289)</label><mml:math id="mml-eqn-289" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The new cell state <inline-formula id="ieqn-1563"><mml:math id="mml-ieqn-1563"><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is formed by summing the scaled (by the forget gate) values of the previous cell state and scaled (by the input gate) values of the candidate values,</p>
<p><disp-formula id="eqn-290"><label>(290)</label><mml:math id="mml-eqn-290" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the component-wise multiplication of matrices, which is also known by the name Hadamard product, is indicated by a &#x201C;<inline-formula id="ieqn-1564"><mml:math id="mml-ieqn-1564"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula>&#x201D;.</p> 
<statement id="st7_5"><title>Remark 7.5.</title>
<p>The path of the cell state <inline-formula id="ieqn-1565"><mml:math id="mml-ieqn-1565"><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is reminiscent of the identity map that jumps over some layers to create a residual map inside the jump in the building block of a residual network; see Figure <xref ref-type="fig" rid="fig-44">44</xref> and also Remark <xref ref-type="statement" rid="st4_7">4.7</xref> on the identity map in residual networks. &#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>Finally, the hidden state of the LSTM cell is computed from the cell state <inline-formula id="ieqn-1566"><mml:math id="mml-ieqn-1566"><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. The cell state is first squashed into the range <inline-formula id="ieqn-1567"><mml:math id="mml-ieqn-1567"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by a hyperbolic tangent tanh, i.e., the general expression Eq. (<xref ref-type="disp-formula" rid="eqn-286">286</xref>) becomes (Figure <xref ref-type="fig" rid="fig-81">81</xref>)</p>
<p><disp-formula id="eqn-291"><label>(291)</label><mml:math id="mml-eqn-291" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2261;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>before the result <inline-formula id="ieqn-1569"><mml:math id="mml-ieqn-1569"><mml:msubsup><mml:mi>&#x2131;</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is multiplied with the output of a third sigmoid gate, i.e., the <italic>output gate</italic>, for which the general expression Eq. (<xref ref-type="disp-formula" rid="eqn-286">286</xref>) becomes (Figure <xref ref-type="fig" rid="fig-81">81</xref>)</p>
<p><disp-formula id="eqn-292"><label>(292)</label><mml:math id="mml-eqn-292" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Hence, the output, i.e., the new hidden state <inline-formula id="ieqn-1570"><mml:math id="mml-ieqn-1570"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, is given by</p>
<p><disp-formula id="eqn-293"><label>(293)</label><mml:math id="mml-eqn-293" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2299;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2299;</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For the LSTM cell, we get an intuition for respective choice of the activation function. The hyperbolic tangent is used to normalize and center information that is to be incorporated into the cell state or the hidden state. The forget gate, input gate and output gate make use of the sigmoid function, which takes values between 0 and 1, to either discard information or allow information to pass by.</p>
<fig id="fig-83">
<label>Figure 83</label>
<caption><title><italic>Folded RNN with Gated Recurrent Unit (GRU)</italic> (Section <xref ref-type="sec" rid="s7_3">7.3</xref>). The cell state at <inline-formula id="ieqn-744"><mml:math id="mml-ieqn-744"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-745"><mml:math id="mml-ieqn-745"><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> are inputs to produce the hidden state <inline-formula id="ieqn-746"><mml:math id="mml-ieqn-746"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. One feedback loop for the hidden state <inline-formula id="ieqn-747"><mml:math id="mml-ieqn-747"><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:math></inline-formula>, with one-step delay <inline-formula id="ieqn-748"><mml:math id="mml-ieqn-748"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. The key unified recurring relation is <inline-formula id="ieqn-749"><mml:math id="mml-ieqn-749"><mml:msub><mml:mi>&#x2131;</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi mathvariant="bold-italic">&#x03B1;</mml:mi><mml:mrow><mml:mo mathvariant="bold" stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">k</mml:mi><mml:mo mathvariant="bold">&#x2212;</mml:mo><mml:mn mathvariant="bold">1</mml:mn><mml:mo mathvariant="bold" stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-750"><mml:math id="mml-ieqn-750"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mtext>&#xA0;(reset)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mrow><mml:mtext>&#xA0;(update)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#xA0;(Output)</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-751"><mml:math id="mml-ieqn-751"><mml:msub><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula> is a logistic sigmoi activation function, and <inline-formula id="ieqn-752"><mml:math id="mml-ieqn-752"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is a linear combination of some inputs with weights plus biases at cell state <inline-formula id="ieqn-753"><mml:math id="mml-ieqn-753"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. Compare to the LSTM cell in Figure <xref ref-type="fig" rid="fig-81">81</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-83.tif"/>
</fig>
</sec>
<sec id="s7_3"><label>7.3</label>
<title>Gated Recurrent Unit (GRU)</title>
<p>In a unified manner, the various relations in the GRU depicted in Figure <xref ref-type="fig" rid="fig-83">83</xref> can be expressed in a single key generic recurring-relation similar to the LSTM Eq. (<xref ref-type="disp-formula" rid="eqn-286">286</xref>):</p>
<p><disp-formula id="eqn-294"><label>(294)</label><mml:math id="mml-eqn-294" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mtext>&#xA0;(reset)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mrow><mml:mtext>&#xA0;(update)</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#xA0;(Output)</mml:mtext></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1571"><mml:math id="mml-ieqn-1571"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> is the input at cell state <inline-formula id="ieqn-1572"><mml:math id="mml-ieqn-1572"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-1573"><mml:math id="mml-ieqn-1573"><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> the hidden variable at cell state <inline-formula id="ieqn-1574"><mml:math id="mml-ieqn-1574"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-1575"><mml:math id="mml-ieqn-1575"><mml:msub><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula> the logistic sigmoid activation function, and <inline-formula id="ieqn-1576"><mml:math id="mml-ieqn-1576"><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> a linear combination of some inputs with weights plus biases at cell state <inline-formula id="ieqn-1577"><mml:math id="mml-ieqn-1577"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>.</p>
<p>To facilitate a direct comparison between the GRU cell and the LSTM cell, the locations of the boxes (gates) in the GRU cell in Figure <xref ref-type="fig" rid="fig-83">83</xref> are identical to those in the LSTM in Figure <xref ref-type="fig" rid="fig-81">81</xref>. It can be observed that in the GRU cell (1) There is no feedback loop for the cell state, (2) The input is <inline-formula id="ieqn-1578"><mml:math id="mml-ieqn-1578"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> (instead of <inline-formula id="ieqn-1579"><mml:math id="mml-ieqn-1579"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>), (3) The <italic>reset gate</italic> replaces the LSTM forget gate, (4) The <italic>update gate</italic> replaces the LSTM input gate, (5) The <italic>identity map</italic> (no effect on <inline-formula id="ieqn-1580"><mml:math id="mml-ieqn-1580"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>) replaces the LSTM external input gate, (6) The complement of the update gate, i.e., <inline-formula id="ieqn-1581"><mml:math id="mml-ieqn-1581"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> replaces the LSTM state activation <inline-formula id="ieqn-1582"><mml:math id="mml-ieqn-1582"><mml:msub><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi></mml:math></inline-formula>. There are fewer activations in the GRU cell compared to the LSTM cell.</p>
<p>The GRU was introduced in [<xref ref-type="bibr" rid="ref-220">220</xref>], and tested against LSTM and tanh-RNN in [<xref ref-type="bibr" rid="ref-221">221</xref>], with concise GRU schematics, and thus not easy to follow, unlike Figure <xref ref-type="fig" rid="fig-83">83</xref>. The GRU relations below follow [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 400.</p>
<p>The hidden variable <inline-formula id="ieqn-1584"><mml:math id="mml-ieqn-1584"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-1585"><mml:math id="mml-ieqn-1585"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, is obtained from the GRU cell state at <inline-formula id="ieqn-1586"><mml:math id="mml-ieqn-1586"><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, including the input <inline-formula id="ieqn-1587"><mml:math id="mml-ieqn-1587"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, and is a convex combination of <inline-formula id="ieqn-1588"><mml:math id="mml-ieqn-1588"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the GRU <italic>output-gate</italic> effect <inline-formula id="ieqn-1589"><mml:math id="mml-ieqn-1589"><mml:msubsup><mml:mi>&#x2131;</mml:mi><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, using the GRU <italic>update-gate</italic> effect <inline-formula id="ieqn-1590"><mml:math id="mml-ieqn-1590"><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> as coefficient</p>
<p><disp-formula id="eqn-295"><label>(295)</label><mml:math id="mml-eqn-295" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2299;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For the GRU <italic>update-gate</italic> effect <inline-formula id="ieqn-1591"><mml:math id="mml-ieqn-1591"><mml:msubsup><mml:mi>&#x2131;</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, the generic relation Eq. (<xref ref-type="disp-formula" rid="eqn-294">294</xref>) becomes (Figure <xref ref-type="fig" rid="fig-83">83</xref>, where the superscript <inline-formula id="ieqn-1592"><mml:math id="mml-ieqn-1592"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> on <inline-formula id="ieqn-1593"><mml:math id="mml-ieqn-1593"><mml:msub><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula> was omitted to alleviate the notation), with <inline-formula id="ieqn-1594"><mml:math id="mml-ieqn-1594"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi></mml:math></inline-formula></p>
<p><disp-formula id="eqn-296"><label>(296)</label><mml:math id="mml-eqn-296" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">U</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For the GRU <italic>output-gate</italic> effect <inline-formula id="ieqn-1595"><mml:math id="mml-ieqn-1595"><mml:msubsup><mml:mi>&#x2131;</mml:mi><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, the generic Eq. (<xref ref-type="disp-formula" rid="eqn-294">294</xref>) becomes (Figure <xref ref-type="fig" rid="fig-83">83</xref>), with <inline-formula id="ieqn-1596"><mml:math id="mml-ieqn-1596"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:math></inline-formula>,</p>
<p><disp-formula id="eqn-297"><label>(297)</label><mml:math id="mml-eqn-297" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x2131;</mml:mi></mml:mrow><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For the GRU <italic>reset-gate</italic> effect <inline-formula id="ieqn-1597"><mml:math id="mml-ieqn-1597"><mml:msubsup><mml:mi>&#x2131;</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, the generic Eq. (<xref ref-type="disp-formula" rid="eqn-294">294</xref>) becomes (Figure <xref ref-type="fig" rid="fig-83">83</xref>), with <inline-formula id="ieqn-1598"><mml:math id="mml-ieqn-1598"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>r</mml:mi></mml:math></inline-formula></p>
<p><disp-formula id="eqn-298"><label>(298)</label><mml:math id="mml-eqn-298" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>F</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">U</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<statement id="st7_6"><title>Remark 7.6.</title>
<p>GRU has fewer activation functions compared to LSTM, and is thus likely to be more efficient, even though it was stated in [<xref ref-type="bibr" rid="ref-221">221</xref>] that no concrete conclusion could be made as to &#x201C;which of the two gating units was better.&#x201D; See Remark <xref ref-type="statement" rid="st7_7">7.7</xref> on the use of GRU to solve hyperbolic problems with shock waves.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> </sec>
<sec id="s7_4"><label>7.4</label>
<title>Sequence modeling, attention mechanisms, Transformer</title>
<p>RNN and LSTM have been well-established for use in sequence modeling and &#x201C;transduction&#x201D; problems such as language modeling and machine translation. Attention mechanism, introduced in [<xref ref-type="bibr" rid="ref-57">57</xref>] [<xref ref-type="bibr" rid="ref-222">222</xref>], allowed for &#x201C;modeling of dependencies without regard to their distance in the input and output sequences,&#x201D; and has been used together with RNNs [<xref ref-type="bibr" rid="ref-31">31</xref>]. Transformer is a much more efficient architecture that only uses an attention mechanism, but without the RNN architecture, to &#x201C;draw global dependencies between input and output.&#x201D; Each of these concepts is discussed in detail below.</p>
<sec id="s7_4_1"><label>7.4.1</label>
<title>Sequence modeling, encoder-decoder</title>
<p>The term <italic>neural machine translation</italic> describes the approach of using a single neural network to translate a sentence.<xref ref-type="fn" rid="fn220"><sup>220</sup></xref><fn id="fn220"><label>220</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-57">57</xref>].</p></fn> Machine translation is a special kind of sequence-to-sequence modeling problem, in which a <italic>source sequence</italic> is &#x201C;translated&#x201D; into a <italic>target sequence</italic>. Neural machine translation typically relies on <italic>encoder&#x2013;decoder</italic> architectures.<xref ref-type="fn" rid="fn221"><sup>221</sup></xref><fn id="fn221"><label>221</label><p><italic>Autoencoders</italic> are a special kind of encoder-decoder networks, which are trained to reproduce the input sequence, see Section <xref ref-type="sec" rid="s12_4_3">12.4.3</xref>.</p></fn></p>
<p>The encoder network converts (encodes) the essential information of the input sequence <inline-formula id="ieqn-1599"><mml:math id="mml-ieqn-1599"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4B3;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> into an intermediate, typically fixed-length vector representation, which is also referred to as <italic>context</italic>. The decoder network subsequently generates (decodes) the output sequence <inline-formula id="ieqn-1600"><mml:math id="mml-ieqn-1600"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4B4;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> from the encoded intermediate vector. The intermediate context vector in the encoder&#x2013;decoder structure allows for input and output sequences of different length. Consider, for instance, an RNN composed from LSTM cells (Section <xref ref-type="sec" rid="s7_2">7.2</xref>) as encoder network that generates the intermediate vector by sequentially processing the elements (words, characters) of the input sequence (sentence). The decoder is a second RNN that accepts the fixed-length intermediate vector generated by the encoder, e.g., as the initial hidden state, and subsequently generates the output sequence (sentences, words) element by element (words, characters).<xref ref-type="fn" rid="fn222"><sup>222</sup></xref><fn id="fn222"><label>222</label><p>For more details on different types of RNNs, see, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 385, Section 10.4 &#x201C;Encoder-decoder sequence-to-sequence architectures.&#x201D;</p></fn></p>
<p>To make things clearer, we briefly sketch the structure of a typical RNN encoder-decoder model following [<xref ref-type="bibr" rid="ref-57">57</xref>]. The encoder <inline-formula id="ieqn-1601"><mml:math id="mml-ieqn-1601"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> is composed from a recurrent neural network <inline-formula id="ieqn-1602"><mml:math id="mml-ieqn-1602"><mml:mrow><mml:mtext>&#x1D557;</mml:mtext></mml:mrow></mml:math></inline-formula> and some non-linear feedforward function <inline-formula id="ieqn-1603"><mml:math id="mml-ieqn-1603"><mml:mrow><mml:mtext>&#x1D556;</mml:mtext></mml:mrow></mml:math></inline-formula>. In terms of our notation, let <inline-formula id="ieqn-1604"><mml:math id="mml-ieqn-1604"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> denote the <inline-formula id="ieqn-1605"><mml:math id="mml-ieqn-1605"><mml:mi>k</mml:mi></mml:math></inline-formula>-th element in the sequence of input vectors, <inline-formula id="ieqn-1606"><mml:math id="mml-ieqn-1606"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4B3;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>, and let <inline-formula id="ieqn-1607"><mml:math id="mml-ieqn-1607"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> denote the corresponding hidden state at time <inline-formula id="ieqn-1608"><mml:math id="mml-ieqn-1608"><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, the hidden state at time <inline-formula id="ieqn-1609"><mml:math id="mml-ieqn-1609"><mml:mi>k</mml:mi></mml:math></inline-formula> follows from the recurrence relation (Figure <xref ref-type="fig" rid="fig-79">79</xref>)</p>
<p><disp-formula id="eqn-299"><label>(299)</label><mml:math id="mml-eqn-299" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D557;</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In Section <xref ref-type="sec" rid="s7_1">7.1</xref>, we referred to <inline-formula id="ieqn-1610"><mml:math id="mml-ieqn-1610"><mml:mrow><mml:mtext>&#x1D557;</mml:mtext></mml:mrow></mml:math></inline-formula> as the transition function (denoted by <inline-formula id="ieqn-1611"><mml:math id="mml-ieqn-1611"><mml:mi>f</mml:mi></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-79">79</xref>). The context vector <inline-formula id="ieqn-1612"><mml:math id="mml-ieqn-1612"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula> is generated by another generally non-linear function <inline-formula id="ieqn-1613"><mml:math id="mml-ieqn-1613"><mml:mrow><mml:mtext>&#x1D556;</mml:mtext></mml:mrow></mml:math></inline-formula>, which takes the sequence of hidden states <inline-formula id="ieqn-1614"><mml:math id="mml-ieqn-1614"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-1615"><mml:math id="mml-ieqn-1615"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> as input:</p>
<p><disp-formula id="eqn-300"><label>(300)</label><mml:math id="mml-eqn-300" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D556;</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Note that <inline-formula id="ieqn-1616"><mml:math id="mml-ieqn-1616"><mml:mrow><mml:mtext>&#x1D556;</mml:mtext></mml:mrow></mml:math></inline-formula> could as well be a function of just the final hidden state in the encoder-RNN.<xref ref-type="fn" rid="fn223"><sup>223</sup></xref><fn id="fn223"><label>223</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 385.</p></fn></p>
<p>From a probabilistic point of view, the joint probability of the entire output sequence (i.e., the &#x201C;translation&#x201D;) <inline-formula id="ieqn-1617"><mml:math id="mml-ieqn-1617"><mml:mrow><mml:mrow><mml:mi mathvariant="bold">y</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> can be decomposed into conditional probabilities of the <inline-formula id="ieqn-1618"><mml:math id="mml-ieqn-1618"><mml:mi>k</mml:mi></mml:math></inline-formula>-th output item <inline-formula id="ieqn-1619"><mml:math id="mml-ieqn-1619"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> given its predecessors <inline-formula id="ieqn-1620"><mml:math id="mml-ieqn-1620"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> and the input sequence <inline-formula id="ieqn-1621"><mml:math id="mml-ieqn-1621"><mml:mrow><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula>, which in term can be approximated using the context vector <inline-formula id="ieqn-1622"><mml:math id="mml-ieqn-1622"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-301"><label>(301)</label><mml:math id="mml-eqn-301" display="block"><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>n</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle><mml:mo>&#x2248;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:math></disp-formula></p>
<p>Accordingly, the decoder is trained to predict the next item (word, character) in the output sequence <inline-formula id="ieqn-1623"><mml:math id="mml-ieqn-1623"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> given the previous items <inline-formula id="ieqn-1624"><mml:math id="mml-ieqn-1624"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> and the context vector <inline-formula id="ieqn-1625"><mml:math id="mml-ieqn-1625"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula>. In analogy to the encoder <inline-formula id="ieqn-1626"><mml:math id="mml-ieqn-1626"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula>, the decoder <inline-formula id="ieqn-1627"><mml:math id="mml-ieqn-1627"><mml:mi>&#x1D49F;</mml:mi></mml:math></inline-formula> comprises a recurrence function <inline-formula id="ieqn-1628"><mml:math id="mml-ieqn-1628"><mml:mrow><mml:mtext>&#x1D558;</mml:mtext></mml:mrow></mml:math></inline-formula> and a non-linear feedforward function <inline-formula id="ieqn-1629"><mml:math id="mml-ieqn-1629"><mml:mrow><mml:mtext>&#x1D555;</mml:mtext></mml:mrow></mml:math></inline-formula>. Practically, RNNs provide an intuitive means to realize functions of variable-length sequences, since the current hidden state in RNNs contains information of all previous inputs. Accordingly, the decoder&#x2019;s hidden state <inline-formula id="ieqn-1630"><mml:math id="mml-ieqn-1630"><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> at step <inline-formula id="ieqn-1631"><mml:math id="mml-ieqn-1631"><mml:mi>k</mml:mi></mml:math></inline-formula> follows from the recurrence relation <inline-formula id="ieqn-1632"><mml:math id="mml-ieqn-1632"><mml:mrow><mml:mtext>&#x1D558;</mml:mtext></mml:mrow></mml:math></inline-formula> as</p>
<p><disp-formula id="eqn-302"><label>(302)</label><mml:math id="mml-eqn-302" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#x1D558;</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>To predict the conditional probability of the next item <inline-formula id="ieqn-1633"><mml:math id="mml-ieqn-1633"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> by means of the function <inline-formula id="ieqn-1634"><mml:math id="mml-ieqn-1634"><mml:mrow><mml:mtext>&#x1D555;</mml:mtext></mml:mrow></mml:math></inline-formula>, the decoder can therefore use only the previous item <inline-formula id="ieqn-1635"><mml:math id="mml-ieqn-1635"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the current hidden state <inline-formula id="ieqn-1636"><mml:math id="mml-ieqn-1636"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">s</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> as inputs (along with the context vector <inline-formula id="ieqn-1637"><mml:math id="mml-ieqn-1637"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula>) rather than all previously predicted items <inline-formula id="ieqn-1638"><mml:math id="mml-ieqn-1638"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-303"><label>(303)</label><mml:math id="mml-eqn-303" display="block"><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>&#x1D555;</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>We have various choices of how the context vector <inline-formula id="ieqn-1639"><mml:math id="mml-ieqn-1639"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula> and inputs <inline-formula id="ieqn-1640"><mml:math id="mml-ieqn-1640"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are fed into to the decoder-RNN.<xref ref-type="fn" rid="fn224"><sup>224</sup></xref><fn id="fn224"><label>224</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 386.</p></fn> The context vector <inline-formula id="ieqn-1641"><mml:math id="mml-ieqn-1641"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula>, for instance, can either be used as the decoder&#x2019;s initial hidden state <inline-formula id="ieqn-1642"><mml:math id="mml-ieqn-1642"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> or, alternatively, as the first input.</p>
</sec>
<sec id="s7_4_2"><label>7.4.2</label>
<title>Attention</title>
<p>As the authors of [<xref ref-type="bibr" rid="ref-57">57</xref>] emphasized, the encoder <italic>&#x201C;needs to be able to compress all the necessary information of a source sentence into fixed-length vector&#x201D;</italic>. For this reason, long sentences pose a challenge in neural machine translation, in particular, as sentences to be translated are longer than the sentences networks have seen during training, which was confirmed by the observations in [<xref ref-type="bibr" rid="ref-223">223</xref>]. To cope with long sentences, an encoder&#x2013;decoder architecture, <italic>&#x201C;which learns to align and translate jointly,&#x201D;</italic> was proposed in [<xref ref-type="bibr" rid="ref-57">57</xref>]. Their approach is motivated by the observation that individual items of the target sequence correspond to different parts of the source sequence. To account for the fact that only a subset of the source sequence is relevant when generating a new item of the target sequence, two key ingredients (alignment and translation) to the conventional encoder&#x2013;decoder architecture described above were introduced in [<xref ref-type="bibr" rid="ref-57">57</xref>], and will be presented below.</p>
<p>&#x261B; The <italic>first key ingredient</italic> to their concept of &#x201C;alignment&#x201D; is the idea of using a distinct context vector <inline-formula id="ieqn-1643"><mml:math id="mml-ieqn-1643"><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> for each output item <inline-formula id="ieqn-1644"><mml:math id="mml-ieqn-1644"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> instead of a single context <inline-formula id="ieqn-1645"><mml:math id="mml-ieqn-1645"><mml:mi mathvariant="bold-italic">c</mml:mi></mml:math></inline-formula>. Accordingly, the recurrence relation of the decoder, Eq. (<xref ref-type="disp-formula" rid="eqn-302">302</xref>), is modified and takes the context <inline-formula id="ieqn-1646"><mml:math id="mml-ieqn-1646"><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> as argument</p>
<p><disp-formula id="eqn-304"><label>(304)</label><mml:math id="mml-eqn-304" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>as is the conditional probability of the output items, Eq. (<xref ref-type="disp-formula" rid="eqn-303">303</xref>),</p>
<p><disp-formula id="eqn-305"><label>(305)</label><mml:math id="mml-eqn-305" display="block"><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;x</mml:mtext></mml:mrow></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2248;</mml:mo><mml:mtext>&#x1D555;</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>i.e., it is conditioned on distinct context vectors <inline-formula id="ieqn-1647"><mml:math id="mml-ieqn-1647"><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> for each output <inline-formula id="ieqn-1648"><mml:math id="mml-ieqn-1648"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-1649"><mml:math id="mml-ieqn-1649"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>.</p>
<p>The <inline-formula id="ieqn-1650"><mml:math id="mml-ieqn-1650"><mml:mi>k</mml:mi></mml:math></inline-formula>-th context vector <inline-formula id="ieqn-1651"><mml:math id="mml-ieqn-1651"><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is supposed to capture the information of that part of the source sequence <inline-formula id="ieqn-1652"><mml:math id="mml-ieqn-1652"><mml:mrow><mml:mrow><mml:mi mathvariant="bold">x</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> which is relevant to the <inline-formula id="ieqn-1653"><mml:math id="mml-ieqn-1653"><mml:mi>k</mml:mi></mml:math></inline-formula>-th target item <inline-formula id="ieqn-1654"><mml:math id="mml-ieqn-1654"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. For this purpose, <inline-formula id="ieqn-1655"><mml:math id="mml-ieqn-1655"><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is computed as weighted sum of all hidden states <inline-formula id="ieqn-1656"><mml:math id="mml-ieqn-1656"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-1657"><mml:math id="mml-ieqn-1657"><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> of the encoder:</p>
<p><disp-formula id="eqn-306"><label>(306)</label><mml:math id="mml-eqn-306" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The <inline-formula id="ieqn-1658"><mml:math id="mml-ieqn-1658"><mml:mi>k</mml:mi></mml:math></inline-formula>-th hidden state of a conventional RNN obeying the recurrence given by Eq. (<xref ref-type="disp-formula" rid="eqn-299">299</xref>) only includes information about the preceding items (<inline-formula id="ieqn-1659"><mml:math id="mml-ieqn-1659"><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>) in the source sequence, since the remaining items (<inline-formula id="ieqn-1660"><mml:math id="mml-ieqn-1660"><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>) still remain to be processed. When generating the <inline-formula id="ieqn-1661"><mml:math id="mml-ieqn-1661"><mml:mi>k</mml:mi></mml:math></inline-formula>-th output item, however, we want information about all source items, before <italic>and</italic> after, to be contained in the <inline-formula id="ieqn-1662"><mml:math id="mml-ieqn-1662"><mml:mi>k</mml:mi></mml:math></inline-formula>-th hidden state <inline-formula id="ieqn-1663"><mml:math id="mml-ieqn-1663"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>For this reason, using a <italic>bidirectional RNN</italic><xref ref-type="fn" rid="fn225"><sup>225</sup></xref><fn id="fn225"><label>225</label><p>See [<xref ref-type="bibr" rid="ref-224">224</xref>].</p></fn> as encoder was proposed in [<xref ref-type="bibr" rid="ref-57">57</xref>]. A bidirectional RNN combines two RNNs, i.e., a <italic>forward RNN</italic> and <italic>backward RNN</italic>, which independently process the source sequence in the original and in reverse order, respectively. The two RNNs generate corresponding sequences of forward and backward hidden states,</p>
<p><disp-formula id="eqn-307"><label>(307)</label><mml:math id="mml-eqn-307" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">f</mml:mi><mml:mi mathvariant="normal">w</mml:mi><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mtext>&#x1D557;</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">f</mml:mi><mml:mi mathvariant="normal">w</mml:mi><mml:mi mathvariant="normal">d</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">f</mml:mi><mml:mi mathvariant="normal">w</mml:mi><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">v</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mtext>&#x1D557;</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">v</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">v</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In each step, these vectors are concatenated to a single hidden state vector <inline-formula id="ieqn-1664"><mml:math id="mml-ieqn-1664"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, which the authors of [<xref ref-type="bibr" rid="ref-57">57</xref>] refer to as &#x201C;<italic>annotation</italic>&#x201D; of the <inline-formula id="ieqn-1665"><mml:math id="mml-ieqn-1665"><mml:mi>k</mml:mi></mml:math></inline-formula>-th source item:</p>
<p><disp-formula id="eqn-308"><label>(308)</label><mml:math id="mml-eqn-308" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">f</mml:mi><mml:mi mathvariant="normal">w</mml:mi><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">v</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>They mentioned <italic>&#x201C;the tendency of RNNs to better represent recent inputs&#x201D;</italic> as reason why the annotation <inline-formula id="ieqn-1666"><mml:math id="mml-ieqn-1666"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> focuses around the <inline-formula id="ieqn-1667"><mml:math id="mml-ieqn-1667"><mml:mi>k</mml:mi></mml:math></inline-formula>-th encoder input <inline-formula id="ieqn-1668"><mml:math id="mml-ieqn-1668"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>&#x261B; As the <italic>second key ingredient</italic>, the authors of [<xref ref-type="bibr" rid="ref-57">57</xref>] proposed a so-called <italic>&#x201C;alignment model&#x201D;</italic>, i.e., a function <inline-formula id="ieqn-1669"><mml:math id="mml-ieqn-1669"><mml:mi mathvariant="fraktur">a</mml:mi></mml:math></inline-formula> to compute weights <inline-formula id="ieqn-1670"><mml:math id="mml-ieqn-1670"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> needed for the context <inline-formula id="ieqn-1671"><mml:math id="mml-ieqn-1671"><mml:msub><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, Eq. (<xref ref-type="disp-formula" rid="eqn-306">306</xref>),</p>
<p><disp-formula id="eqn-309"><label>(309)</label><mml:math id="mml-eqn-309" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">a</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is meant to quantify (&#x201C;score&#x201D;) the relation, i.e., the <italic>alignment</italic>, between the <inline-formula id="ieqn-1672"><mml:math id="mml-ieqn-1672"><mml:mi>k</mml:mi></mml:math></inline-formula>-th decoder output (target) and inputs <italic>&#x201C;around&#x201D;</italic> the <inline-formula id="ieqn-1673"><mml:math id="mml-ieqn-1673"><mml:mi>l</mml:mi></mml:math></inline-formula>-th position of the source sequence. The score is computed from the decoder&#x2019;s hidden state of the previous output <inline-formula id="ieqn-1674"><mml:math id="mml-ieqn-1674"><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the annotation <inline-formula id="ieqn-1675"><mml:math id="mml-ieqn-1675"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of the <inline-formula id="ieqn-1676"><mml:math id="mml-ieqn-1676"><mml:mi>l</mml:mi></mml:math></inline-formula>-th input item, so it <italic>&#x201C;reflects the importance of the annotation <inline-formula id="ieqn-1677"><mml:math id="mml-ieqn-1677"><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> with respect to the previous hidden state <inline-formula id="ieqn-1678"><mml:math id="mml-ieqn-1678"><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> in deciding the next state <inline-formula id="ieqn-1679"><mml:math id="mml-ieqn-1679"><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and generating <inline-formula id="ieqn-1680"><mml:math id="mml-ieqn-1680"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>.&#x201D;</italic></p>
<p>The alignment model is represented by a feedforward neural network, which is jointly trained along with all other components of the encoder&#x2013;decoder architecture. The weights of the annotations, in turn, follow from the alignment scores upon exponentiation and normalization (through <inline-formula id="ieqn-1681"><mml:math id="mml-ieqn-1681"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (Section <xref ref-type="sec" rid="s5_1_3">5.1.3</xref>) along the second dimension):</p>
<p><disp-formula id="eqn-310"><label>(310)</label><mml:math id="mml-eqn-310" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The weighting in Eq. (<xref ref-type="disp-formula" rid="eqn-306">306</xref>) is interpreted as a way <italic>&#x201C;to compute an expected annotation, where the expectation is over possible alignments&quot;</italic> [<xref ref-type="bibr" rid="ref-57">57</xref>]. From this perspective, <inline-formula id="ieqn-1682"><mml:math id="mml-ieqn-1682"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the probability of an output (target) item <inline-formula id="ieqn-1683"><mml:math id="mml-ieqn-1683"><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> being aligned to an input (source) item <inline-formula id="ieqn-1684"><mml:math id="mml-ieqn-1684"><mml:msup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>In neural machine translation it was possible to show that the attention model in [<xref ref-type="bibr" rid="ref-57">57</xref>] significantly outperformed conventional encoder&#x2013;decoder architectures, which encoded the entire source sequence into a single fixed-length vector. In particular, this proposed approach turned out to perform better in translating long sentences, where it could achieve performance on par with phrase-based statistical machine translation approaches of that time.</p>
</sec>
<sec id="s7_4_3"><label>7.4.3</label>
<title>Transformer architecture</title>
<p>Despite improvements, the attention model in [<xref ref-type="bibr" rid="ref-57">57</xref>] shared the fundamental drawback intrinsic to all RNN-based models: The sequential nature of RNNs is adverse to parallel computing making training less efficient as with, e.g., feed-forward or convolutional neural networks, which lend themselves to massive parallelization. To overcome this drawback, a novel model architecture, which entirely dispenses with recurrence, was proposed in [<xref ref-type="bibr" rid="ref-31">31</xref>]. As the title <italic>&#x201C;Attention Is All You Need&#x201D;</italic> already reveals, their approach to neural machine translation is exclusively based on the concept of attention (and some feedforward-layers), which is repeatedly used in the proposed architecture referred to as <italic>&#x201C;Transformer&#x201D;</italic>.</p>
<p>In what follows, we describe the individual components of the Transformer architecture, see Figure <xref ref-type="fig" rid="fig-85">85</xref>. Among those, the <italic>scaled dot-product attention</italic>, see Figure <xref ref-type="fig" rid="fig-84">84</xref>, is a fundamental building block. Scaled-dot product attention, which is represented by a function <inline-formula id="ieqn-1685"><mml:math id="mml-ieqn-1685"><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:math></inline-formula>, compares a query vector <inline-formula id="ieqn-1686"><mml:math id="mml-ieqn-1686"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> and a set of <inline-formula id="ieqn-1687"><mml:math id="mml-ieqn-1687"><mml:mi>m</mml:mi></mml:math></inline-formula> key vectors <inline-formula id="ieqn-1688"><mml:math id="mml-ieqn-1688"><mml:msub><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> to determine the weighting of value vectors <inline-formula id="ieqn-1689"><mml:math id="mml-ieqn-1689"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">V</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> corresponding to the keys. As opposed to the additive alignment model used in [<xref ref-type="bibr" rid="ref-57">57</xref>], see Eq. (<xref ref-type="disp-formula" rid="eqn-309">309</xref>), scaled-dot product attention combines query and key vectors in a multiplicative way.</p>
<p>Let <inline-formula id="ieqn-1690"><mml:math id="mml-ieqn-1690"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1691"><mml:math id="mml-ieqn-1691"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">V</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> denote key and value matrices formed from the individual vectors, where query and key vectors share the dimension <inline-formula id="ieqn-1692"><mml:math id="mml-ieqn-1692"><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> and value vectors are <inline-formula id="ieqn-1693"><mml:math id="mml-ieqn-1693"><mml:msub><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">V</mml:mi></mml:msub></mml:math></inline-formula>-dimensional. The attention model produces a context vector <inline-formula id="ieqn-1694"><mml:math id="mml-ieqn-1694"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">V</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> by weighting the values <inline-formula id="ieqn-1695"><mml:math id="mml-ieqn-1695"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1696"><mml:math id="mml-ieqn-1696"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, according to the multiplicative alignment of the query <inline-formula id="ieqn-1697"><mml:math id="mml-ieqn-1697"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> with keys <inline-formula id="ieqn-1698"><mml:math id="mml-ieqn-1698"><mml:msub><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1699"><mml:math id="mml-ieqn-1699"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-311"><label>(311)</label><mml:math id="mml-eqn-311" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Scaling with the square root of the query/key dimension is supposed to prevent pushing the <inline-formula id="ieqn-1700"><mml:math id="mml-ieqn-1700"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> function to regions of small gradients for large <inline-formula id="ieqn-1701"><mml:math id="mml-ieqn-1701"><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> as scalar products grow with the dimension of queries and keys. The above attention model can be simultaneously applied to multiple queries. For this purpose, let <inline-formula id="ieqn-1702"><mml:math id="mml-ieqn-1702"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1703"><mml:math id="mml-ieqn-1703"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi></mml:mrow></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">V</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> denote matrices of query and context vectors, respectively. We can rewrite the attention model using matrix multiplication as follows:</p>
<p><disp-formula id="eqn-312"><label>(312)</label><mml:math id="mml-eqn-312" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Note that <inline-formula id="ieqn-1704"><mml:math id="mml-ieqn-1704"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> gives a <inline-formula id="ieqn-1705"><mml:math id="mml-ieqn-1705"><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> matrix, for which the <inline-formula id="ieqn-1706"><mml:math id="mml-ieqn-1706"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is computed along the second dimension.</p>
<p>Based on the concept of scaled dot-product attention, the idea of using multiple attention functions in parallel rather than just a single one was proposed in [<xref ref-type="bibr" rid="ref-31">31</xref>], see Figure <xref ref-type="fig" rid="fig-84">84</xref>. In their concept of <italic>&#x201C;Multi-Head Attention&#x201D;</italic>, each &#x201C;head&#x201D; represents a separate context <inline-formula id="ieqn-1707"><mml:math id="mml-ieqn-1707"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> computed from scaled dot-product attention, Eq. (<xref ref-type="disp-formula" rid="eqn-312">312</xref>), on queries <inline-formula id="ieqn-1708"><mml:math id="mml-ieqn-1708"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, keys <inline-formula id="ieqn-1709"><mml:math id="mml-ieqn-1709"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> and values <inline-formula id="ieqn-1710"><mml:math id="mml-ieqn-1710"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, respectively. The inputs to the individual scaled dot-product attention functions <inline-formula id="ieqn-1711"><mml:math id="mml-ieqn-1711"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1712"><mml:math id="mml-ieqn-1712"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1713"><mml:math id="mml-ieqn-1713"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1714"><mml:math id="mml-ieqn-1714"><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi></mml:math></inline-formula>, in turn, are head-specific (learned) projections of the queries <inline-formula id="ieqn-1715"><mml:math id="mml-ieqn-1715"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, keys <inline-formula id="ieqn-1716"><mml:math id="mml-ieqn-1716"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and values <inline-formula id="ieqn-1717"><mml:math id="mml-ieqn-1717"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4B1;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Assuming that queries <inline-formula id="ieqn-1718"><mml:math id="mml-ieqn-1718"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1719"><mml:math id="mml-ieqn-1719"><mml:msub><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-1720"><mml:math id="mml-ieqn-1720"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> share the dimension <inline-formula id="ieqn-1721"><mml:math id="mml-ieqn-1721"><mml:mi>d</mml:mi></mml:math></inline-formula>, the projections are represented by matrices <inline-formula id="ieqn-1722"><mml:math id="mml-ieqn-1722"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">q</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-1723"><mml:math id="mml-ieqn-1723"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1724"><mml:math id="mml-ieqn-1724"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">v</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>:<xref ref-type="fn" rid="fn226"><sup>226</sup></xref><fn id="fn226"><label>226</label><p>Unlike the use of notation in [<xref ref-type="bibr" rid="ref-31">31</xref>], we use different symbols for the arguments of the scaled dot-product attention function, Eq. (<xref ref-type="disp-formula" rid="eqn-312">312</xref>), and those of the multi-head attention, Eq. (<xref ref-type="disp-formula" rid="eqn-314">314</xref>), to emphasize their distinct dimensions.</p></fn></p>
<fig id="fig-84">
<label>Figure 84</label>
<caption><title><italic>Scaled dot-product attention and multi-head attention</italic> (Section <xref ref-type="sec" rid="s7_4_3">7.4.3</xref>). Scaled-dot product attention (left) is the elementary building block of the Transformer model. It compares query vectors (Q) against a set of key vectors (K) to produce a context vector by weighting to value vectors (V) that correspond to the keys. For this purpose, <inline-formula id="ieqn-754"><mml:math id="mml-ieqn-754"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> function is applied to the inner product (MatMul) of the query and key vectors (scaled by a constant). The output of the <inline-formula id="ieqn-755"><mml:math id="mml-ieqn-755"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represent the weighting by which value vectors are scaled taking the inner product (MatMul), see <xref ref-type="disp-formula" rid="eqn-311">Eq.(311)</xref>. To prevent attention functions of the decoder, which generates the output sequence item by item, from using future items of the output sequence, a masking layer is introduced. By the masking, all scores beyond the current time/position are set to <inline-formula id="ieqn-756"><mml:math id="mml-ieqn-756"><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>. Multi-head attention (right) combines several (<inline-formula id="ieqn-757"><mml:math id="mml-ieqn-757"><mml:mi mathvariant="bold-italic">h</mml:mi></mml:math></inline-formula>) scaled-dot product attention functions in parallel, each of which is referred to as &#x201C;head&#x201D;. For this purpose, queries, keys and values are projected by means of a head-specific linear layers (Linear), whose outputs are input to the individual scaled dot-product attention functions, see <xref ref-type="disp-formula" rid="eqn-313">Eq.(313)</xref>. The context vectors produced by each head are concatenated before being fed into one more linear layer, see <xref ref-type="disp-formula" rid="eqn-314">Eq.(314)</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-84.tif"/>
</fig>
<p><disp-formula id="eqn-313"><label>(313)</label><mml:math id="mml-eqn-313" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4B1;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Multi-head attention combines the individual &#x201C;heads&#x201D; <inline-formula id="ieqn-1725"><mml:math id="mml-ieqn-1725"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> through concatenation (along the second dimension) and subsequent projection by means of <inline-formula id="ieqn-1726"><mml:math id="mml-ieqn-1726"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>O</mml:mi></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">V</mml:mi></mml:msub><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>,</p>
<p><disp-formula id="eqn-314"><label>(314)</label><mml:math id="mml-eqn-314" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x0210B;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">h</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4B1;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>O</mml:mi></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1727"><mml:math id="mml-ieqn-1727"><mml:mi mathvariant="bold-italic">h</mml:mi></mml:math></inline-formula> denotes the number of heads, i.e., individual scaled dot-product attention functions used in parallel. Note that the output of multi-head attention <inline-formula id="ieqn-1728"><mml:math id="mml-ieqn-1728"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x0210B;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula> has the same dimensions of as the input queries <inline-formula id="ieqn-1729"><mml:math id="mml-ieqn-1729"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AC;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-1730"><mml:math id="mml-ieqn-1730"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x0210B;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>To understand why the projection is essential in the Transformer architecture, we shift our attention (no pun intended) to the encoder-structure illustrated in Figure <xref ref-type="fig" rid="fig-85">85</xref>. The encoder combines a stack of <inline-formula id="ieqn-1731"><mml:math id="mml-ieqn-1731"><mml:mi>N</mml:mi></mml:math></inline-formula> identical layers, which, in turn, are composed from two sub-layers each. To stack the layers without further projection, all inputs and outputs of the encoder layers share the same dimension. The same holds true for each of the sub-layers, which are also designed to preserve the input dimensions. The first sub-layer is a multi-head <italic>self-attention</italic> function, in which the input sequence attends to itself. The concept of self-attention is based on the idea to relate different items of a single sequence to generate a representation of the input, i.e., <italic>&#x201C;each position in the encoder can attend to all positions in the previous layer of the encoder.&#x201D;</italic></p>
<fig id="fig-85">
<label>Figure 85</label>
<caption><title><italic>Transformer architecture</italic> (Section <xref ref-type="sec" rid="s7_4_3">7.4.3</xref>). The Transformer is a sequence-to-sequence model without recurrent connections. Encoder and decoder are entirely built upon scaled dot-product attention. Items of source and target sequences are numerically represented as vectors, i.e., embeddings. Positional encodings furnish embeddings with information on their positions within the respective sequences. The encoder stack comprises <inline-formula id="ieqn-758"><mml:math id="mml-ieqn-758"><mml:mi>N</mml:mi></mml:math></inline-formula> layers, each of which is composed from two sub-layers: The first sub-layer is a multi-head attention function used as a <italic>self-attention</italic>, by which relations among the items of one and the same sequence are learned. The second sub-layer is a <italic>position-wise</italic> fully-connected network. The decoder stack is also built from <inline-formula id="ieqn-759"><mml:math id="mml-ieqn-759"><mml:mi>N</mml:mi></mml:math></inline-formula> layers consisting of three sub-layers. The first and third sub-layers are identical to the encoder except for masking in the self-attention function. The second sub-layer is a multi-head attention function using the encoder&#x2019;s output as keys and values to relate items of the source and target sequences. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-85.tif"/>
</fig>
<p>In the context of multi-head attention, self-attention implies that one and the same sequence multiply serves as queries, keys and values, respectively. Let <inline-formula id="ieqn-1732"><mml:math id="mml-ieqn-1732"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> denote the input to the self-attention (sub-)layer and <inline-formula id="ieqn-1733"><mml:math id="mml-ieqn-1733"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x0210B;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> the corresponding output, where <inline-formula id="ieqn-1734"><mml:math id="mml-ieqn-1734"><mml:mi>n</mml:mi></mml:math></inline-formula> is the length of the sequence and <inline-formula id="ieqn-1735"><mml:math id="mml-ieqn-1735"><mml:mi>d</mml:mi></mml:math></inline-formula> is the dimension of single items (<italic>&#x201C;positions&#x201D;</italic>), self-attention can be expressed in terms of the multi-head attention function as</p>
<p><disp-formula id="eqn-315"><label>(315)</label><mml:math id="mml-eqn-315" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x0210B;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">h</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The authors of [<xref ref-type="bibr" rid="ref-31">31</xref>] introduced a residual connection around the self-attention (sub-)layer, which, in view of the same dimensions of heads and queries, reduces to a simple addition.</p>
<p>To prevent values from growing upon summation, the residual connection is followed by <italic>layer normalization</italic> as proposed in [<xref ref-type="bibr" rid="ref-225">225</xref>], which scales the input to zero mean and unit variance:</p>
<p><disp-formula id="eqn-316"><label>(316)</label><mml:math id="mml-eqn-316" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E3;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>&#x03C3;</mml:mi></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E3;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">I</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1736"><mml:math id="mml-ieqn-1736"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1737"><mml:math id="mml-ieqn-1737"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> denote the mean value and the standard deviation of the input components; <inline-formula id="ieqn-1738"><mml:math id="mml-ieqn-1738"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">I</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula> is a matrix of ones with the same dimension as the input <inline-formula id="ieqn-1739"><mml:math id="mml-ieqn-1739"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E3;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>. The normalized output of the encoder&#x2019;s first sub-layer therefore follows from the sum of inputs to self-attention function <inline-formula id="ieqn-1740"><mml:math id="mml-ieqn-1740"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula> and its outputs <inline-formula id="ieqn-1741"><mml:math id="mml-ieqn-1741"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4D7;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula> as</p>
<p><disp-formula id="eqn-317"><label>(317)</label><mml:math id="mml-eqn-317" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4D7;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The output of the first sub-layer is the input to the second sub-layer within the encoder stack, i.e., a <italic>&#x201C;position-wise feed-forward network.&#x201D;</italic> Position-wise means that a fully connected feedforward network (see Section <xref ref-type="sec" rid="s4">4</xref> on &#x201C;Static, feedforward networks&#x201D;), which is subsequently represented by the function <inline-formula id="ieqn-1742"><mml:math id="mml-ieqn-1742"><mml:mi>&#x2131;</mml:mi></mml:math></inline-formula>, is applied to each item (i.e., <italic>&#x201C;position&#x201D;</italic>) of the input sequence <inline-formula id="ieqn-1743"><mml:math id="mml-ieqn-1743"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> <italic>&#x201C;separately and identically.&#x201D;</italic> In particular, a network with a single hidden layer and a linear rectifier unit (see Section <xref ref-type="sec" rid="s5_3_2">5.3.2</xref> on &#x201C;Rectified linear function (ReLU)&#x201D;) as activation function was used in [<xref ref-type="bibr" rid="ref-31">31</xref>] to compute the vectors <inline-formula id="ieqn-1744"><mml:math id="mml-ieqn-1744"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>:<xref ref-type="fn" rid="fn227"><sup>227</sup></xref><fn id="fn227"><label>227</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-318">318</xref>) is meant to reflect the idea off <italic>position-wise</italic> computations of the second sub-layer, since single vectors <inline-formula id="ieqn-3276"><mml:math id="mml-ieqn-3276"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> are input to the feedforward network. From the computational point of view, however, all &#x201C;positions&#x201D; can be processed simultaneously using the same weights and biases as in Eq. (<xref ref-type="disp-formula" rid="eqn-318">318</xref>): <inline-formula id="ieqn-3277"><mml:math id="mml-ieqn-3277"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:math></inline-formula> Note that the addition of bias vectors <inline-formula id="ieqn-3278"><mml:math id="mml-ieqn-3278"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-3279"><mml:math id="mml-ieqn-3279"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> needs to be computed for all positions, i.e., they are added row-wisely to the matrix-valued projections of the layer inputs by means <inline-formula id="ieqn-3280"><mml:math id="mml-ieqn-3280"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-3281"><mml:math id="mml-ieqn-3281"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:math></inline-formula>, respectively.</p></fn></p>
<p><disp-formula id="eqn-318"><label>(318)</label><mml:math id="mml-eqn-318" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x1D4D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1745"><mml:math id="mml-ieqn-1745"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-1746"><mml:math id="mml-ieqn-1746"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> denote the connection weights and <inline-formula id="ieqn-1747"><mml:math id="mml-ieqn-1747"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1748"><mml:math id="mml-ieqn-1748"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are bias vectors. The individual outputs <inline-formula id="ieqn-1749"><mml:math id="mml-ieqn-1749"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>, which correspond to the respective items <inline-formula id="ieqn-1750"><mml:math id="mml-ieqn-1750"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> of the input sequence <inline-formula id="ieqn-1751"><mml:math id="mml-ieqn-1751"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>, form the matrix <inline-formula id="ieqn-1752"><mml:math id="mml-ieqn-1752"><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>As for the first sub-layer, the authors of [<xref ref-type="bibr" rid="ref-31">31</xref>] introduced a residual connection followed by layer normalization around the feedforward network. The ouput of the encoder&#x2019;s second sublayer, which, at the same time, is the output of the encoder layer, is given by</p>
<p><disp-formula id="eqn-319"><label>(319)</label><mml:math id="mml-eqn-319" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Within the transformer architecture (Figure <xref ref-type="fig" rid="fig-85">85</xref>), the encoder is composed from <inline-formula id="ieqn-1753"><mml:math id="mml-ieqn-1753"><mml:mi>N</mml:mi></mml:math></inline-formula> encoder layers, which are &#x201C;stacked&#x201D;, i.e., the output of the <inline-formula id="ieqn-1754"><mml:math id="mml-ieqn-1754"><mml:mi>&#x2113;</mml:mi></mml:math></inline-formula>-th layer is input to the subsequent layer <inline-formula id="ieqn-1755"><mml:math id="mml-ieqn-1755"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:math></inline-formula>. Let <inline-formula id="ieqn-1756"><mml:math id="mml-ieqn-1756"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x2130;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:math></inline-formula> denote the <inline-formula id="ieqn-1757"><mml:math id="mml-ieqn-1757"><mml:mi>&#x2113;</mml:mi></mml:math></inline-formula>-th encoder layer composed from the layer-specific self-attention function <inline-formula id="ieqn-1758"><mml:math id="mml-ieqn-1758"><mml:msubsup><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, which takes <inline-formula id="ieqn-1759"><mml:math id="mml-ieqn-1759"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> as input, and the layer-specific feedforward network <inline-formula id="ieqn-1760"><mml:math id="mml-ieqn-1760"><mml:msup><mml:mi>&#x2131;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, the layer&#x2019;s output <inline-formula id="ieqn-1761"><mml:math id="mml-ieqn-1761"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:math></inline-formula> is computed as follows:</p>
<p><disp-formula id="eqn-320"><label>(320)</label><mml:math id="mml-eqn-320" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>&#x2131;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>or, alternatively, using a step-wise representation,</p>
<p><disp-formula id="eqn-321"><label>(321)</label><mml:math id="mml-eqn-321" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4D7;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-322"><label>(322)</label><mml:math id="mml-eqn-322" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D7C0;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4D7;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-323"><label>(323)</label><mml:math id="mml-eqn-323" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x2131;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-324"><label>(324)</label><mml:math id="mml-eqn-324" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Note that inputs and outputs of all components of an encoder layer share the same dimensions, which facilitates several layers to be stacked without additional projections in between.</p>
<p>The decoder&#x2019;s structure within the transformer architecture is similar to that of the encoder, see the right part in Figure <xref ref-type="fig" rid="fig-85">85</xref>. As the encoder, it is composed from <inline-formula id="ieqn-1762"><mml:math id="mml-ieqn-1762"><mml:mi>N</mml:mi></mml:math></inline-formula> identical layers, each of which combines three sub-layers (as opposed to two sub-layers in the encoder). In addition to the self-attention sub-layer (first sub-layer) and the fully-connected position-wise feed-forward network (third sub-layer), the decoder inserts a second sub-layer that <italic>&#x201C;performs multi-head attention over the output of the encoder stack&#x201D;</italic> in between. Attending to outputs of the encoder enables the decoder to relate items of the source sequence to items of the target sequence. Just as in the encoder, residual connections are introduced around each of the decoder&#x2019;s sub-layers.</p>
<p>Let <inline-formula id="ieqn-1763"><mml:math id="mml-ieqn-1763"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> denote the input to the <inline-formula id="ieqn-1764"><mml:math id="mml-ieqn-1764"><mml:mi>&#x2113;</mml:mi></mml:math></inline-formula>-th decoder layer, which, for <inline-formula id="ieqn-1765"><mml:math id="mml-ieqn-1765"><mml:mi>&#x2113;</mml:mi><mml:mo>&gt;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, is the output <inline-formula id="ieqn-1766"><mml:math id="mml-ieqn-1766"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4D3;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> of the previous layer. The output <inline-formula id="ieqn-1767"><mml:math id="mml-ieqn-1767"><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4D3;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:math></inline-formula> of the <inline-formula id="ieqn-1768"><mml:math id="mml-ieqn-1768"><mml:mi>&#x2113;</mml:mi></mml:math></inline-formula>-th decoder layer is then obtained through the following computations. The first sub-layer performs (multi-headed) self-attention over items of the output sequence, i.e., it establishes a relation among them:</p>
<p><disp-formula id="eqn-325"><label>(325)</label><mml:math id="mml-eqn-325" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4D7;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-326"><label>(326)</label><mml:math id="mml-eqn-326" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4D7;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The second sub-layer relates items of the source and the target sequences by means of multi-head attention:</p>
<p><disp-formula id="eqn-327"><label>(327)</label><mml:math id="mml-eqn-327" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D49C;</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x0190;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-328"><label>(328)</label><mml:math id="mml-eqn-328" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The third sub-layer is a fully-connected feed-forward network (with a single hidden layer) that is applied to each element of the sequence (position-wisely)</p>
<p><disp-formula id="eqn-329"><label>(329)</label><mml:math id="mml-eqn-329" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x2131;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-330"><label>(330)</label><mml:math id="mml-eqn-330" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4D3;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x1D4A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4E9;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4E8;</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The output of the last (<inline-formula id="ieqn-1769"><mml:math id="mml-ieqn-1769"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mi>N</mml:mi></mml:math></inline-formula>) decoder-layer within the decoder-stack is projected onto a vector that has the dimension of the &#x201C;vocabulary&#x201D;, i.e., the set of all feasible output items. Taking the <inline-formula id="ieqn-1770"><mml:math id="mml-ieqn-1770"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> over the components of the vector produces probabilities over elements of the vocabulary to be item of the output sequence (see Section <xref ref-type="sec" rid="s5_1_3">5.1.3</xref>).</p>
<p>A Transformer-model is trained on the complete source and target sequences, which are input to the encoder and the decoder, respectively. The target sequence is shifted right by one position, such that a special token indicating the start of a new sequence can be placed at the beginning, see <xref ref-type="fig" rid="fig-85">Fig. 85</xref>. To prevent the decoder from attending to future items of the output sequence, the (multi-headed) self-attention sub-layer needs to be <italic>&#x201C;masked&#x201D;</italic>. The masking is realized by setting those inputs to the <inline-formula id="ieqn-1771"><mml:math id="mml-ieqn-1771"><mml:mrow><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> function of the scaled dot-product attention, see Eq. (<xref ref-type="disp-formula" rid="eqn-311">311</xref>), for which query vectors are aligned with keys that correspond to items in the output sequence beyond the respective query&#x2019;s position, set to <inline-formula id="ieqn-1772"><mml:math id="mml-ieqn-1772"><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>. As loss function, the negative log-likelihood is typically used; see Section <xref ref-type="sec" rid="s5_1_2">5.1.2</xref> on maximum likelihood (probability cost) and Section <xref ref-type="sec" rid="s5_1_3">5.1.3</xref> on classification loss functions.</p>
<p>As the transformer architecture does not have recurrent connections, <italic>positional encodings</italic> were added to the inputs of both encoder and decoder [<xref ref-type="bibr" rid="ref-31">31</xref>], see Figure <xref ref-type="fig" rid="fig-85">85</xref>. The positional encodings supply the individual items of the inputs to the encoder and the decoder with information about their positions within the respective sequences. For this purpose, the authors of [<xref ref-type="bibr" rid="ref-31">31</xref>] proposed to add vector-valued positional encodings <inline-formula id="ieqn-1773"><mml:math id="mml-ieqn-1773"><mml:msub><mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, which have the same dimension <inline-formula id="ieqn-1774"><mml:math id="mml-ieqn-1774"><mml:mi>d</mml:mi></mml:math></inline-formula> as the input embedding of an item, i.e., its numerical representation as a vector. They used sine and cosine functions of an items position (<inline-formula id="ieqn-1775"><mml:math id="mml-ieqn-1775"><mml:mi>i</mml:mi></mml:math></inline-formula>) within respective sequence, whose frequency varies (decreases) with the component index (<inline-formula id="ieqn-1776"><mml:math id="mml-ieqn-1776"><mml:mi>j</mml:mi></mml:math></inline-formula>):</p>
<p><disp-formula id="eqn-331"><label>(331)</label><mml:math id="mml-eqn-331" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mi>j</mml:mi><mml:msup><mml:mn>10000</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mi>i</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mfrac><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The trained Transformer-model produces one item of the output sequence at a time. Given the input sequence and all outputs already generated in previous steps, the Transformer predicts probabilities for the next item in the output sequence. Models of this kind are referred to as &#x201C;auto-regressive&#x201D;.</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-31">31</xref>] varied parameters of the Transformer model to study the importance of individual components. Their &#x201C;base model&#x201D; used <inline-formula id="ieqn-1777"><mml:math id="mml-ieqn-1777"><mml:mi>N</mml:mi><mml:mo>=</mml:mo><mml:mn>6</mml:mn></mml:math></inline-formula> encoder and decoder layers. Inputs and outputs of all sub-layers are sequences of vectors, which have a dimension of <inline-formula id="ieqn-1778"><mml:math id="mml-ieqn-1778"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>512</mml:mn></mml:math></inline-formula>. All multi-head attention functions of the Transformer, see <xref ref-type="disp-formula" rid="eqn-321">Eq.(321)</xref>, <xref ref-type="disp-formula" rid="eqn-325">Eq.(325)</xref> and <xref ref-type="disp-formula" rid="eqn-327">Eq.(327)</xref>, have <inline-formula id="ieqn-1779"><mml:math id="mml-ieqn-1779"><mml:mi>h</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>8</mml:mn></mml:mrow></mml:math></inline-formula> heads each. Before the alignment scores are computed by each head, the dimensions of queries, keys and values were reduced to <inline-formula id="ieqn-1780"><mml:math id="mml-ieqn-1780"><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mn>64</mml:mn></mml:mrow></mml:math></inline-formula>. The hidden layer of the fully-connected feedforward network, see <xref ref-type="disp-formula" rid="eqn-318">Eq.(318)</xref>, was chosen as <inline-formula id="ieqn-1781"><mml:math id="mml-ieqn-1781"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mn>2048</mml:mn></mml:mrow></mml:math></inline-formula> neurons wide. With this set of (hyper-)parameters, the Transformer model features a total of 65 &#x00D7; 10<sup>6</sup> parameters. The training cost (in terms of FLOPS) of the attention-based Transformer was shown to be (at least) two orders of magnitudes smaller than comparable models at that time while the performance was maintained. A refined (&#x201C;big&#x201D;) model was able to outperform all previous approaches (based on RNNs or convolutional neural networks) in English-to-French and English-to-German translation tasks.</p>
<p>As of 2022, the <italic>Generative Pre-trained Transformer 3</italic> (GPT-3) model [<xref ref-type="bibr" rid="ref-226">226</xref>], which is based on the Transformer architecture, belongs to the most powerful language models. GPT-3 is an autoregressive model that produces text from a given (initial) text prompt, whereby it can deal with different tasks as translation, question-answering, cloze-tests<xref ref-type="fn" rid="fn228"><sup>228</sup></xref><fn id="fn228"><label>228</label><p>A cloze is &#x201C;a form of written examination in which candidates are required to provide words that have been omitted from sentences, thereby demonstrating their knowledge and comprehension of the text&#x201D;, see <ext-link ext-link-type="uri" xlink:href="https://en.wiktionary.org/w/index.php?title=cloze&amp;oldid=65140547">Wiktionary version 11:23, 2 January 2022</ext-link>.</p></fn> and word-unscrambling, for instance. The impressive capabilities of GPT-3 are enabled by its huge capacity of 175 <italic>billion</italic> parameters, which is 10 times more than preceding language models.</p> 
<statement id="st7_7"><title>Remark 7.7.</title>
<p><italic>Attention mechanism, kernel machines, physics-informed neural networks</italic> (PINNs). In [<xref ref-type="bibr" rid="ref-227">227</xref>], a new attention architecture (mechanism) was proposed by using kernel machines discussed in Section <xref ref-type="sec" rid="s8">8</xref>, whereas in [<xref ref-type="bibr" rid="ref-228">228</xref>], the gated recurrent units (GRU, Section <xref ref-type="sec" rid="s7_3">7.3</xref>) and the attention mechanism (Section <xref ref-type="sec" rid="s7_4_1">7.4.1</xref>) were used in conjunction with Physics-Informed Neural Networks (PINNs, Section <xref ref-type="sec" rid="s9_5">9.5</xref>) to solve hyperbolic problems with shock waves; Remark <xref ref-type="statement" rid="st9_5">9.5</xref> and Remark <xref ref-type="statement" rid="st11_11">11.11</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec></sec></sec>
<sec id="s8"><label>8</label>
<title>Kernel machines (methods, learning)</title>
<p>Researchers have observed that as the number of parameters increased beyond the interpolation threshold, or as the number of hidden units in a layer (i.e., layer width) increased, the test error decreased, i.e., such network generalized well; see Figures <xref ref-type="fig" rid="fig-60">60</xref> and <xref ref-type="fig" rid="fig-61">61</xref>. So, as a first step to try to understand why deep-learning networks work (Section <xref ref-type="sec" rid="s14_2">14.2</xref> on &#x201C;Lack of understanding&#x201D;), it is natural to study the limniting case of infinite layer width first, since it would be relatively easier than the case of finite layer width [<xref ref-type="bibr" rid="ref-229">229</xref>]; see Figure <xref ref-type="fig" rid="fig-148">148</xref>.</p>
<p>In doing so, a connection between networks with infinite width and the kernel machines or kernel methods, was revealed [<xref ref-type="bibr" rid="ref-230">230</xref>] [<xref ref-type="bibr" rid="ref-231">231</xref>] [<xref ref-type="bibr" rid="ref-232">232</xref>]. See also the connection between kernel methods and Support Vector Machines (SVM) in Footnote <xref ref-type="fn" rid="fn31">31</xref>.</p>
<p>&#x201C;A neural network is a little bit like a Rube Goldberg machine. You don&#x2019;t know which part of it is really important. ... reducing [them] to kernel methods&#x2013;because kernel methods don&#x2019;t have all this complexity&#x2013;somehow allows us to isolate the engine of what&#x2019;s going on&#x201D; [<xref ref-type="bibr" rid="ref-230">230</xref>].</p>
<p>Quanta Magazine described the discovery of such connection as the 2021 breakthrough in computer science [<xref ref-type="bibr" rid="ref-233">233</xref>].</p>
<p>Covariance functions, or covariance matrices, are kernels in Gaussian processes (Section <xref ref-type="sec" rid="s8_3">8.3</xref>), an important class of methods in machine learning [<xref ref-type="bibr" rid="ref-234">234</xref>] [<xref ref-type="bibr" rid="ref-130">130</xref>]. A kernel method in terms of the time variable was discussed in Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref> in connection with the continuous temperal summation in neuroscience. Our aim here is only to provide first-time learners background material on kernel methods in terms of space variables (specifically the &#x201C;Setup&#x201D; in [<xref ref-type="bibr" rid="ref-235">235</xref>]) in preparation to read more advanced references mentioned in this section, such as [<xref ref-type="bibr" rid="ref-235">235</xref>] [<xref ref-type="bibr" rid="ref-232">232</xref>] [<xref ref-type="bibr" rid="ref-236">236</xref>] etc.</p>
<sec id="s8_1"><label>8.1</label>
<title>Reproducing kernel: General theory</title>
<p>A kernel <inline-formula id="ieqn-1782"><mml:math id="mml-ieqn-1782"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is called <italic>reproducing</italic> if its scalar (<inline-formula id="ieqn-1783"><mml:math id="mml-ieqn-1783"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>) product with a function <inline-formula id="ieqn-1784"><mml:math id="mml-ieqn-1784"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> reproduces the same function <inline-formula id="ieqn-1785"><mml:math id="mml-ieqn-1785"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> itself [<xref ref-type="bibr" rid="ref-237">237</xref>]:</p>
<p><disp-formula id="eqn-332"><label>(332)</label><mml:math id="mml-eqn-332" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>y</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the subscript <inline-formula id="ieqn-1786"><mml:math id="mml-ieqn-1786"><mml:mi>y</mml:mi></mml:math></inline-formula> in the scalar product <inline-formula id="ieqn-1787"><mml:math id="mml-ieqn-1787"><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>y</mml:mi></mml:msub></mml:math></inline-formula> indicates the integrand.</p>
<p>Let <inline-formula id="ieqn-1788"><mml:math id="mml-ieqn-1788"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> be a set of linearly independent basis functions. A function <inline-formula id="ieqn-1789"><mml:math id="mml-ieqn-1789"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> can be expressed in this basis as follows:</p>
<p><disp-formula id="eqn-333"><label>(333)</label><mml:math id="mml-eqn-333" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The scalar product of two functions <inline-formula id="ieqn-1790"><mml:math id="mml-ieqn-1790"><mml:mi>f</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1791"><mml:math id="mml-ieqn-1791"><mml:mi>g</mml:mi></mml:math></inline-formula> is given by:</p>
<p><disp-formula id="eqn-334"><label>(334)</label><mml:math id="mml-eqn-334" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where</p>
<p><disp-formula id="eqn-335"><label>(335)</label><mml:math id="mml-eqn-335" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>&gt;</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>is the Gram matrix,<xref ref-type="fn" rid="fn229"><sup>229</sup></xref><fn id="fn229"><label>229</label><p>The stiffness matrix in the displacement finite element method is a Gram matrix.</p></fn> which is strictly positive definite, with its inverse (also strictly positive definite) denoted by</p>
<p><disp-formula id="eqn-336"><label>(336)</label><mml:math id="mml-eqn-336" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>&gt;</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Then a reproducing kernel can be written as [<xref ref-type="bibr" rid="ref-237">237</xref>]<xref ref-type="fn" rid="fn230"><sup>230</sup></xref><fn id="fn230"><label>230</label><p>See Eq. (<xref ref-type="disp-formula" rid="eqn-6">6</xref>), p. 346, in [<xref ref-type="bibr" rid="ref-237">237</xref>].</p></fn></p>
<p><disp-formula id="eqn-337"><label>(337)</label><mml:math id="mml-eqn-337" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<statement id="st8_1"><title>Remark 8.1.</title>
<p>It is easy to verify that the function <inline-formula id="ieqn-1792"><mml:math id="mml-ieqn-1792"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-337">337</xref>) is a reproducing kernel:</p>
<p><disp-formula id="eqn-338"><label>(338)</label><mml:math id="mml-eqn-338" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>y</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>y</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>y</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-339"><label>(339)</label><mml:math id="mml-eqn-339" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>using Eqs. (<xref ref-type="disp-formula" rid="eqn-335">335</xref>) and (<xref ref-type="disp-formula" rid="eqn-336">336</xref>).&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>From Eq. (<xref ref-type="disp-formula" rid="eqn-334">334</xref>), the norm of a function <inline-formula id="ieqn-1793"><mml:math id="mml-ieqn-1793"><mml:mi>f</mml:mi></mml:math></inline-formula> can be defined as</p>
<p><disp-formula id="eqn-340"><label>(340)</label><mml:math id="mml-eqn-340" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi>f</mml:mi><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B6;</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mi mathvariant="bold-italic">&#x03B6;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">&#x03B6;</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">&#x03B6;</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">&#x03B6;</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1794"><mml:math id="mml-ieqn-1794"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the norm induced by the kernel <inline-formula id="ieqn-1795"><mml:math id="mml-ieqn-1795"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:math></inline-formula>, and where the matrix notation in Eq. (<xref ref-type="disp-formula" rid="eqn-7">7</xref>) was used.</p>
<p>When the basis functions in <inline-formula id="ieqn-1796"><mml:math id="mml-ieqn-1796"><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> are mutually orthogonal, then the Gram matrix <inline-formula id="ieqn-1797"><mml:math id="mml-ieqn-1797"><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-335">335</xref>) is diagonal, and the reproducing kernel <inline-formula id="ieqn-1798"><mml:math id="mml-ieqn-1798"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-337">337</xref>) takes the simple form:</p>
<p><disp-formula id="eqn-341"><label>(341)</label><mml:math id="mml-eqn-341" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&gt;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1799"><mml:math id="mml-ieqn-1799"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the Kronecker delta, and where the summation convention on repeated indices, except when enclosed in parentheses, was applied. In the case of infinite-dimensional space of functions, Eq. (<xref ref-type="disp-formula" rid="eqn-341">341</xref>), Eq. (<xref ref-type="disp-formula" rid="eqn-333">333</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-334">334</xref>) would be written with <inline-formula id="ieqn-1800"><mml:math id="mml-ieqn-1800"><mml:mi>n</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>:<xref ref-type="fn" rid="fn231"><sup>231</sup></xref><fn id="fn231"><label>231</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-342">342</xref>)<inline-formula id="ieqn-3282"><mml:math id="mml-ieqn-3282"><mml:msub><mml:mi></mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and Eqs. (<xref ref-type="disp-formula" rid="eqn-343">343</xref>)<inline-formula id="ieqn-3283"><mml:math id="mml-ieqn-3283"><mml:msub><mml:mi></mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> were given as Eq. (5.46), Eq. (5.47), and Eq. (5.45), respectively, in [<xref ref-type="bibr" rid="ref-238">238</xref>], pp. 168-169, where the basis functions <inline-formula id="ieqn-3284"><mml:math id="mml-ieqn-3284"><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> were the &#x201C;eigen-functions,&#x201D; and where the general case non-orthogonal basis presented in Eq. (<xref ref-type="disp-formula" rid="eqn-337">337</xref>) was not discussed. Technically, Eq. (<xref ref-type="disp-formula" rid="eqn-343">343</xref>)<inline-formula id="ieqn-3285"><mml:math id="mml-ieqn-3285"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> is accompanied by the conditions <inline-formula id="ieqn-3286"><mml:math id="mml-ieqn-3286"><mml:msub><mml:mo>&#x0394;</mml:mo><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, for <inline-formula id="ieqn-3287"><mml:math id="mml-ieqn-3287"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-3288"><mml:math id="mml-ieqn-3288"><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>&lt;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, i.e., the sum of the squared coefficients is finite [<xref ref-type="bibr" rid="ref-238">238</xref>], p. 188.</p></fn></p>
<p><disp-formula id="eqn-342"><label>(342)</label><mml:math id="mml-eqn-342" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03B6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03B7;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-343"><label>(343)</label><mml:math id="mml-eqn-343" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi>f</mml:mi><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:msub><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03B6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03B6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Let <inline-formula id="ieqn-1801"><mml:math id="mml-ieqn-1801"><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> be the loss (cost) function, with <inline-formula id="ieqn-1802"><mml:math id="mml-ieqn-1802"><mml:mi>x</mml:mi></mml:math></inline-formula> being the data, <inline-formula id="ieqn-1803"><mml:math id="mml-ieqn-1803"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the predicted output, and <inline-formula id="ieqn-1804"><mml:math id="mml-ieqn-1804"><mml:mi>y</mml:mi></mml:math></inline-formula> the label. Consider the regularized minimization problem:</p>
<p><disp-formula id="eqn-344"><label>(344)</label><mml:math id="mml-eqn-344" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mi>f</mml:mi><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mn>1</mml:mn><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msubsup></mml:mrow></mml:munder><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is a &#x201C;ridge&#x201D; penalty method<xref ref-type="fn" rid="fn232"><sup>232</sup></xref><fn id="fn232"><label>232</label><p>See [<xref ref-type="bibr" rid="ref-238">238</xref>], Eq. (3.41), p. 61, which is in Section <xref ref-type="sec" rid="s13_4_1">3.4.1</xref> on &#x201C;Ridge regression,&#x201D; which &#x201C;shrinks the regression coefficients by imposing a penalty on them.&#x201D; Here the penalty is imposed on the kernel-induced norm of <inline-formula id="ieqn-3289"><mml:math id="mml-ieqn-3289"><mml:mi>f</mml:mi></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-3290"><mml:math id="mml-ieqn-3290"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi>f</mml:mi><mml:msubsup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>.</p></fn> with <inline-formula id="ieqn-1805"><mml:math id="mml-ieqn-1805"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> being the penalty coefficient (or regularization parameter), with the aim of forcing the norm of the minimizer, i.e, <inline-formula id="ieqn-1806"><mml:math id="mml-ieqn-1806"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, to be as small as possible, by penalizing the objective function (cost <inline-formula id="ieqn-1807"><mml:math id="mml-ieqn-1807"><mml:mi>L</mml:mi></mml:math></inline-formula> plus penalty <inline-formula id="ieqn-1808"><mml:math id="mml-ieqn-1808"><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mi>f</mml:mi><mml:msubsup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>) if it were not.<xref ref-type="fn" rid="fn233"><sup>233</sup></xref><fn id="fn233"><label>233</label><p>In classical regularization, the loss function (1st term) in Eq. (<xref ref-type="disp-formula" rid="eqn-344">344</xref>) is called the &#x201C;empirical risk&#x201D; and the penalty term (2nd term) the &#x201C;stabilizer&#x201D; [<xref ref-type="bibr" rid="ref-239">239</xref>].</p></fn> What is remarkable is that even though the minimization problem is <italic>infinite</italic> dimensional, the minimizer (solution) <inline-formula id="ieqn-1809"><mml:math id="mml-ieqn-1809"><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> of Eq. (<xref ref-type="disp-formula" rid="eqn-344">344</xref>) is <italic>finite</italic> dimensional [<xref ref-type="bibr" rid="ref-238">238</xref>]:</p>
<p><disp-formula id="eqn-345"><label>(345)</label><mml:math id="mml-eqn-345" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1810"><mml:math id="mml-ieqn-1810"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a basis function, and <inline-formula id="ieqn-1811"><mml:math id="mml-ieqn-1811"><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula> the corresponding coefficient, for <inline-formula id="ieqn-1812"><mml:math id="mml-ieqn-1812"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>.</p> 
<statement id="st8_2"><title>Remark 8.2.</title>
<p><italic>Finite-dimensional solution to infinite-dimentional problem</italic>. Since <inline-formula id="ieqn-1813"><mml:math id="mml-ieqn-1813"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is expressed as in Eq. (<xref ref-type="disp-formula" rid="eqn-342">342</xref>)<inline-formula id="ieqn-1814"><mml:math id="mml-ieqn-1814"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-1815"><mml:math id="mml-ieqn-1815"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-343">343</xref>)<inline-formula id="ieqn-1816"><mml:math id="mml-ieqn-1816"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, it follows that</p>
<p><disp-formula id="eqn-346"><label>(346)</label><mml:math id="mml-eqn-346" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>y</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>following the same argument as in Remark <xref ref-type="statement" rid="st8_1">8.1</xref>. As a result,</p>
<p><disp-formula id="eqn-347"><label>(347)</label><mml:math id="mml-eqn-347" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>y</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Thus <inline-formula id="ieqn-1817"><mml:math id="mml-ieqn-1817"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the Gram matrix of the set of functions <inline-formula id="ieqn-1818"><mml:math id="mml-ieqn-1818"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mi>&#x1D4A6;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>&#x1D4A6;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. To show that <inline-formula id="ieqn-1819"><mml:math id="mml-ieqn-1819"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&gt;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, i.e., positive definite, for any set <inline-formula id="ieqn-1820"><mml:math id="mml-ieqn-1820"><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>a</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></inline-formula>, consider</p>
<p><disp-formula id="eqn-348"><label>(348)</label><mml:math id="mml-eqn-348" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>a</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>a</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:munderover><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:munderover><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-349"><label>(349)</label><mml:math id="mml-eqn-349" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>b</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo>:=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:msub><mml:mi>a</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is equivalent to the matrix <inline-formula id="ieqn-1821"><mml:math id="mml-ieqn-1821"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> being positive definite, i.e., <inline-formula id="ieqn-1822"><mml:math id="mml-ieqn-1822"><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&gt;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>,<xref ref-type="fn" rid="fn234"><sup>234</sup></xref><fn id="fn234"><label>234</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-240">240</xref>], p. 11.</p></fn> and thus the functions <inline-formula id="ieqn-1823"><mml:math id="mml-ieqn-1823"><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>&#x1D4A6;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> are linearly independent, and form a basis, making expression such as Eq. (<xref ref-type="disp-formula" rid="eqn-345">345</xref>) possible.</p>
<p>A goal now is to show that the solution to the <italic>infinite</italic>-dimensional regularized minimization problem Eq. (<xref ref-type="disp-formula" rid="eqn-344">344</xref>) is <italic>finite</italic> dimensional, for which the coefficients <inline-formula id="ieqn-1826"><mml:math id="mml-ieqn-1826"><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-1827"><mml:math id="mml-ieqn-1827"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, in Eq. (<xref ref-type="disp-formula" rid="eqn-345">345</xref>) are to be determined. It is also remarkable that the solution of the form Eq. (<xref ref-type="disp-formula" rid="eqn-345">345</xref>) holds in general for <italic>any</italic> type of differentiable loss function <inline-formula id="ieqn-1828"><mml:math id="mml-ieqn-1828"><mml:mi>L</mml:mi></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-344">344</xref>), and not necessarily restricted to the squared-error loss [<xref ref-type="bibr" rid="ref-241">241</xref>] [<xref ref-type="bibr" rid="ref-239">239</xref>].</p>
<p>For notation compactness, let the objective function (loss plus penalty) in Eq. (<xref ref-type="disp-formula" rid="eqn-344">344</xref>) be written as</p>
<p><disp-formula id="eqn-350"><label>(350)</label><mml:math id="mml-eqn-350" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover><mml:mi>L</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo stretchy="false">[</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>:=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mi>f</mml:mi><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and set the derivative of <inline-formula id="ieqn-1829"><mml:math id="mml-ieqn-1829"><mml:mover accent='true'><mml:mi>L</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> with respect to the coefficients <inline-formula id="ieqn-1830"><mml:math id="mml-ieqn-1830"><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:math></inline-formula>, for <inline-formula id="ieqn-1831"><mml:math id="mml-ieqn-1831"><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, to zero to solve for these coefficients:</p>
<p><disp-formula id="eqn-351"><label>(351)</label><mml:math id="mml-eqn-351" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mover><mml:mi>L</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03BB;</mml:mi><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mi>f</mml:mi><mml:msub><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03BB;</mml:mi><mml:mfrac><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mfrac><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-352"><label>(352)</label><mml:math id="mml-eqn-352" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msubsup><mml:mi>&#x03B6;</mml:mi><mml:mi>p</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03BB;</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo>:=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03BB;</mml:mi></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-353"><label>(353)</label><mml:math id="mml-eqn-353" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03B6;</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the last expression in Eq. (<xref ref-type="disp-formula" rid="eqn-353">353</xref>) came from using the kernel expression in Eq. (<xref ref-type="disp-formula" rid="eqn-343">343</xref>)<inline-formula id="ieqn-1832"><mml:math id="mml-ieqn-1832"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, and the end result is Eq. (<xref ref-type="disp-formula" rid="eqn-345">345</xref>), i.e., the solution (minimizer) <inline-formula id="ieqn-1833"><mml:math id="mml-ieqn-1833"><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> is of finite dimension.<xref ref-type="fn" rid="fn235"><sup>235</sup></xref><fn id="fn235"><label>235</label><p>See also [<xref ref-type="bibr" rid="ref-238">238</xref>], p. 169, Eq. (5.50), and p. 185.</p></fn>&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>For the squared-error loss,</p>
<p><disp-formula id="eqn-354"><label>(354)</label><mml:math id="mml-eqn-354" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=&#x2225;</mml:mo><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-355"><label>(355)</label><mml:math id="mml-eqn-355" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x22C6;</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>the coefficients <inline-formula id="ieqn-1834"><mml:math id="mml-ieqn-1834"><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-353">353</xref>) (or Eq. (<xref ref-type="disp-formula" rid="eqn-345">345</xref>)) can be computed using Eq. (<xref ref-type="disp-formula" rid="eqn-352">352</xref>)<inline-formula id="ieqn-1835"><mml:math id="mml-ieqn-1835"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-356"><label>(356)</label><mml:math id="mml-eqn-356" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>&#x03BB;</mml:mi></mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>It is then clear from the above that the &#x201C;Setup&#x201D; section in [<xref ref-type="bibr" rid="ref-235">235</xref>] simply corresponded to the particular case where the penalty parameter was zero:</p>
<p><disp-formula id="eqn-357"><label>(357)</label><mml:math id="mml-eqn-357" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>i.e., <inline-formula id="ieqn-1836"><mml:math id="mml-ieqn-1836"><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> is interpolating.</p>
<p>For technical jargon such as Reproducing Kernel Hilbert Space (RKHS), Riesz Representation Theorem, <inline-formula id="ieqn-1837"><mml:math id="mml-ieqn-1837"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as a representer of evaluation at <inline-formula id="ieqn-1838"><mml:math id="mml-ieqn-1838"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, etc. to describe several concepts presented above, see [<xref ref-type="bibr" rid="ref-237">237</xref>] [<xref ref-type="bibr" rid="ref-242">242</xref>] [<xref ref-type="bibr" rid="ref-240">240</xref>] [<xref ref-type="bibr" rid="ref-238">238</xref>].<xref ref-type="fn" rid="fn236"><sup>236</sup></xref><fn id="fn236"><label>236</label><p>A succinct introduction to Hilbert space and the Riesz Representation theorem, with detailed proofs, starting from the basic definitions, can be found in [<xref ref-type="bibr" rid="ref-243">243</xref>].</p></fn></p>
<table-wrap id="table-5"><label>Table 5</label>
<caption>
<p><italic>Some reproducing kernels</italic> (Section <xref ref-type="sec" rid="s8">8</xref>). See [<xref ref-type="bibr" rid="ref-239">239</xref>], [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 296, p. 305, [<xref ref-type="bibr" rid="ref-244">244</xref>].</p></caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="left">Regularization network</th>
<th align="left">Kernel function</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Gaussian Radial Basis Function</td>
<td align="left"><inline-formula id="ieqn-419"><mml:math id="mml-ieqn-419"><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Exponential (Laplacian)</td>
<td align="left"><inline-formula id="ieqn-420"><mml:math id="mml-ieqn-420"><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Inverse multiquadric</td>
<td align="left"><inline-formula id="ieqn-421"><mml:math id="mml-ieqn-421"><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Multiquadric</td>
<td align="left"><inline-formula id="ieqn-422"><mml:math id="mml-ieqn-422"><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Thin plate spline (a)</td>
<td align="left"><inline-formula id="ieqn-423"><mml:math id="mml-ieqn-423"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Thin plate spline (b)</td>
<td align="left"><inline-formula id="ieqn-424"><mml:math id="mml-ieqn-424"><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mi>log</mml:mi><mml:mo>&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x2225;</mml:mo></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Multilayer perceptron (for some values of <inline-formula id="ieqn-425"><mml:math id="mml-ieqn-425"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula>)</td>
<td align="left"><inline-formula id="ieqn-426"><mml:math id="mml-ieqn-426"><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Polynomial of degree <italic>d</italic></td>
<td align="left"><inline-formula id="ieqn-427"><mml:math id="mml-ieqn-427"><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>y</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>d</mml:mi></mml:msup></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s8_2"><label>8.2</label>
<title>Exponential functions as reproducing kernels</title>
<p>A list of reproducing kernels is given in, e.g., [<xref ref-type="bibr" rid="ref-241">241</xref>] [<xref ref-type="bibr" rid="ref-239">239</xref>], such as those in Table <xref ref-type="table" rid="table-5">5</xref>. Two reproducing kernels with exponential function were used to understand how deep learning works [<xref ref-type="bibr" rid="ref-235">235</xref>], and are listed in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>): (1) the popular smooth Gaussian kernel <inline-formula id="ieqn-1841"><mml:math id="mml-ieqn-1841"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mi>G</mml:mi></mml:msub></mml:math></inline-formula> in in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>)<inline-formula id="ieqn-1842"><mml:math id="mml-ieqn-1842"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, and (2) the non-smooth Laplacian (exponential) kernel<xref ref-type="fn" rid="fn237"><sup>237</sup></xref><fn id="fn237"><label>237</label><p>In [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 305, the kernel <inline-formula id="ieqn-3291"><mml:math id="mml-ieqn-3291"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mi>L</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>)<inline-formula id="ieqn-3292"><mml:math id="mml-ieqn-3292"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> was called &#x201C;exponential kernel,&#x201D; without a reference to Laplace, but referred to &#x201C;the Ornstein-Uhlenbeck process originally introduced by Uhlenbeck and Ornstein (1930) to describe Brownian motion.&#x201D; Similarly, in [<xref ref-type="bibr" rid="ref-234">234</xref>], p. 85, the term &#x201C;exponential covariance function&#x201D; (or kernel) was used for the kernel <inline-formula id="ieqn-3293"><mml:math id="mml-ieqn-3293"><mml:mi>k</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-3294"><mml:math id="mml-ieqn-3294"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula>, in connection with the Ornstein-Uhlenbeck process. Even though the name &#x201C;kernel&#x201D; came from the theorie of integral operator [<xref ref-type="bibr" rid="ref-234">234</xref>], p. 80, the attribution of the exponential kernel to Laplace came from the <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Laplace_distribution&amp;oldid=1106362547">Laplace probability distribution</ext-link> (Wikepedia, version 06:55, 24 August 2022), also called the &#x201C;double exponential&#x201D; distribution, but not from the different kernel used in the Laplace transform. See also Remark 8.3.1 and Figure <xref ref-type="fig" rid="fig-86">86</xref>.</p></fn> <inline-formula id="ieqn-1843"><mml:math id="mml-ieqn-1843"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mi>L</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>)<inline-formula id="ieqn-1844"><mml:math id="mml-ieqn-1844"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-358"><label>(358)</label><mml:math id="mml-eqn-358" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mi>G</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:msup><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mi>L</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">&#x2225;</mml:mo></mml:mrow><mml:mi>&#x03C3;</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where &#x03C3; is the standard deviation. The method in Remark <xref ref-type="statement" rid="st8_1">8.1</xref> is not suitable to show that these exponential functions are reproducing kernels. We provide a verificatrion of the reproducing property of the Laplacian kernel in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>)<sub>2</sub>.</p> 
<statement id="st8_3"><title>Remark 8.3.</title>
<p><italic>Laplacian kernel is reproducing</italic>. Consider the Laplacian kernel in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>)<inline-formula id="ieqn-1847"><mml:math id="mml-ieqn-1847"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> for the scalar case <inline-formula id="ieqn-1848"><mml:math id="mml-ieqn-1848"><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with <inline-formula id="ieqn-1849"><mml:math id="mml-ieqn-1849"><mml:mi>&#x03C3;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> for simplicity, without loss of generality. The goal is to show that the reproducing property in Eq. (<xref ref-type="disp-formula" rid="eqn-332">332</xref>)<inline-formula id="ieqn-1850"><mml:math id="mml-ieqn-1850"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> holds for such kernel expressed in Eq. (<xref ref-type="disp-formula" rid="eqn-359">359</xref>)<inline-formula id="ieqn-1851"><mml:math id="mml-ieqn-1851"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> below.</p>
<p><disp-formula id="eqn-359"><label>(359)</label><mml:math id="mml-eqn-359" display="block"><mml:mrow><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>exp</mml:mi><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mo>&#x007C;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x007C;</mml:mo></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x21D2;</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>y</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x2009;for&#x2009;</mml:mtext><mml:mi>y</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x2009;for&#x2009;</mml:mtext><mml:mi>y</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow> </mml:mrow></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-360"><label>(360)</label><mml:math id="mml-eqn-360" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>y</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>y</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>x</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The method is by using integration by parts and by using a function norm different from that in Eq. (<xref ref-type="disp-formula" rid="eqn-332">332</xref>)<inline-formula id="ieqn-1852"><mml:math id="mml-ieqn-1852"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>; see [<xref ref-type="bibr" rid="ref-240">240</xref>], p. 8. Now start with the integral in Eq. (<xref ref-type="disp-formula" rid="eqn-332">332</xref>)<inline-formula id="ieqn-1853"><mml:math id="mml-ieqn-1853"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, and do integration by parts:</p>
<p><disp-formula id="eqn-361"><label>(361)</label><mml:math id="mml-eqn-361" display="block"><mml:mrow><mml:mstyle displaystyle='true'><mml:mrow><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mrow><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x2032;</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msubsup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mi>f</mml:mi><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x221E;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x221E;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-362"><label>(362)</label><mml:math id="mml-eqn-362" display="block"><mml:mrow><mml:msubsup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mi>f</mml:mi><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x221E;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x221E;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x221E;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>+</mml:mo></mml:msup></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x221E;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>+</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>+</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-363"><label>(363)</label><mml:math id="mml-eqn-363" display="block"><mml:mrow><mml:mo>&#x21D2;</mml:mo><mml:mstyle displaystyle='true'><mml:mrow><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo>+</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi mathvariant='script'>K</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1854"><mml:math id="mml-ieqn-1854"><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-1855"><mml:math id="mml-ieqn-1855"><mml:msup><mml:mi>x</mml:mi><mml:mo>+</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> with <inline-formula id="ieqn-1856"><mml:math id="mml-ieqn-1856"><mml:mi>&#x03F5;</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> being very small. The scalar product on the space of functions that are differentiable almost everywhere, i.e.,</p>
<p><disp-formula id="eqn-364"><label>(364)</label><mml:math id="mml-eqn-364" display="block"><mml:mrow><mml:mo>&#x2329;</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:mo>&#x232A;</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo stretchy='false'>[</mml:mo><mml:mstyle displaystyle='true'><mml:mrow><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo>+</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mi>f</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>g</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>y</mml:mi><mml:mo stretchy='false'>]</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>together with Eq. (<xref ref-type="disp-formula" rid="eqn-363">363</xref>), and <inline-formula id="ieqn-1857"><mml:math id="mml-ieqn-1857"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, show that the Laplacian kernel is reproducing.<xref ref-type="fn" rid="fn238"><sup>238</sup></xref><fn id="fn238"><label>238</label><p>It is possible to define the Laplacian kernel in Eq. (<xref ref-type="disp-formula" rid="eqn-359">359</xref>)<inline-formula id="ieqn-3295"><mml:math id="mml-ieqn-3295"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> with the factor <inline-formula id="ieqn-3296"><mml:math id="mml-ieqn-3296"><mml:mi>f</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mn>12</mml:mn></mml:math></inline-formula>, which then will not appear in the definition of the scalar product in Eq. (<xref ref-type="disp-formula" rid="eqn-364">364</xref>). See also [<xref ref-type="bibr" rid="ref-240">240</xref>], p. 8.</p></fn>&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec>
<sec id="s8_3"><label>8.3</label>
<title>Gaussian processes</title>
<p>The Kalman filter, well-known in engineering, is an example of a Gaussian-process model. See also Remark <xref ref-type="statement" rid="st9_6">9.6</xref> in Section <xref ref-type="sec" rid="s9_5">9.5</xref> on the 2021 US patent on Physics-Informed Learning Machine that was based on Gaussian processes (GPs),<xref ref-type="fn" rid="fn239"><sup>239</sup></xref><fn id="fn239"><label>239</label><p>Non-Gaussian processes, such as in [<xref ref-type="bibr" rid="ref-245">245</xref>] [<xref ref-type="bibr" rid="ref-246">246</xref>], are also important, but more advanced, and thus beyond the scope of the present review. See also the Gaussian-Process Summer-School videos (Youtube) <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/playlist?list=PLZ_xn3EIbxZHoq8A3-2F4_rLyy61vkEpU">2019</ext-link> and <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/playlist?list=PLZ_xn3EIbxZGcqHGFj-P_SI6OCXy8TfoL">2021</ext-link>. We thank David Duvenaud for noting that we did not review non-Gaussian processes.</p></fn> which possess the &#x201C;most pleasant resolution imaginable&#x201D; to the question of how to computationally deal with infinite-dimensional objects like functions [<xref ref-type="bibr" rid="ref-234">234</xref>], p. 2:</p>
<disp-quote><p>&#x201C;If you ask only for the properties of the function at a finite number of points, then the inference from the Gaussian process will give you the same answer if you ignore the infinitely many other points, as if you would have taken them into account! And these answers are consistent with any other finite queries you may have. One of the main attractions of the Gaussian process framework is precisely that it unites a sophisticated and consistent view with computational tractability.&#x201D;</p>
</disp-quote><p>A simple example of a Gaussian process is the linear model <inline-formula id="ieqn-1858"><mml:math id="mml-ieqn-1858"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> below [<xref ref-type="bibr" rid="ref-130">130</xref>]:</p>
<p><disp-formula id="eqn-365"><label>(365)</label><mml:math id="mml-eqn-365" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>w</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x230A;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x230B;</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>w</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-366"><label>(366)</label><mml:math id="mml-eqn-366" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with random coefficients <inline-formula id="ieqn-1859"><mml:math id="mml-ieqn-1859"><mml:msub><mml:mi>w</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1860"><mml:math id="mml-ieqn-1860"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, being normally distributed with zero mean <inline-formula id="ieqn-1861"><mml:math id="mml-ieqn-1861"><mml:mi>&#x03BC;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> and unit variance <inline-formula id="ieqn-1862"><mml:math id="mml-ieqn-1862"><mml:mrow><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula>, and with basis functions <inline-formula id="ieqn-1863"><mml:math id="mml-ieqn-1863"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>n</mml:mi></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>.</p>
<p>More generally, <inline-formula id="ieqn-1864"><mml:math id="mml-ieqn-1864"><mml:mrow><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> could be any basis of nonlinear functions in <inline-formula id="ieqn-1865"><mml:math id="mml-ieqn-1865"><mml:mi>x</mml:mi></mml:math></inline-formula>, and the weights in <inline-formula id="ieqn-1866"><mml:math id="mml-ieqn-1866"><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> could have a joint Gaussian distribition with zero mean and a given covariance marix <inline-formula id="ieqn-1867"><mml:math id="mml-ieqn-1867"><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, i.e, <inline-formula id="ieqn-1868"><mml:math id="mml-ieqn-1868"><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. If <inline-formula id="ieqn-1869"><mml:math id="mml-ieqn-1869"><mml:msub><mml:mi>w</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-1870"><mml:math id="mml-ieqn-1870"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, are idependently and identically distributed (i.i.d.), with variance <inline-formula id="ieqn-1871"><mml:math id="mml-ieqn-1871"><mml:msup><mml:mtext>&#x03C3;</mml:mtext><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, then the covariance matrix is diagonal (and is called &#x201C;isotropic&#x201D; [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 84), i.e., <inline-formula id="ieqn-1872"><mml:math id="mml-ieqn-1872"><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-1873"><mml:math id="mml-ieqn-1873"><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-1874"><mml:math id="mml-ieqn-1874"><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> being the identity matrix.</p>
<p>Formally, a <italic>Gaussian process</italic> is a probability distribution over the functions <inline-formula id="ieqn-1875"><mml:math id="mml-ieqn-1875"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> such that, for any given arbitrary set of input training points <inline-formula id="ieqn-1876"><mml:math id="mml-ieqn-1876"><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, the set of output values of <inline-formula id="ieqn-1877"><mml:math id="mml-ieqn-1877"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at input training points, i.e., <inline-formula id="ieqn-1878"><mml:math id="mml-ieqn-1878"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, such that <inline-formula id="ieqn-1879"><mml:math id="mml-ieqn-1879"><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which from Eq. (<xref ref-type="disp-formula" rid="eqn-365">365</xref>) can be written as</p>
<p><disp-formula id="eqn-367"><label>(367)</label><mml:math id="mml-eqn-367" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;such that&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and the <italic>design</italic> matrix <inline-formula id="ieqn-1880"><mml:math id="mml-ieqn-1880"><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, has a joint probability distribution; see [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 305.</p>
<p>Another way to put it succinctly, a Gaussian process describes a distribution over functions, and is defined as a collection of random variables (representing the values of the function <inline-formula id="ieqn-1881"><mml:math id="mml-ieqn-1881"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at location <inline-formula id="ieqn-1882"><mml:math id="mml-ieqn-1882"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:math></inline-formula>), such that any finite subset of which has a joint probability distribution [<xref ref-type="bibr" rid="ref-234">234</xref>], p. 13.</p>
<p>The multivariate (joint probability) Gaussian distribution for an <inline-formula id="ieqn-1883"><mml:math id="mml-ieqn-1883"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> matrix <inline-formula id="ieqn-1884"><mml:math id="mml-ieqn-1884"><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:math></inline-formula>, with mean <inline-formula id="ieqn-1885"><mml:math id="mml-ieqn-1885"><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> (element-wise expectation of <inline-formula id="ieqn-1886"><mml:math id="mml-ieqn-1886"><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:math></inline-formula>) and covariance matrix <inline-formula id="ieqn-1887"><mml:math id="mml-ieqn-1887"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> (element-wise expectation of <inline-formula id="ieqn-1888"><mml:math id="mml-ieqn-1888"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>) is written as</p>
<p><disp-formula id="eqn-368"><label>(368)</label><mml:math id="mml-eqn-368" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo movablelimits="true" form="prefix">det</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mrow></mml:mfrac><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-369"><label>(369)</label><mml:math id="mml-eqn-369" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-370"><label>(370)</label><mml:math id="mml-eqn-370" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1889"><mml:math id="mml-ieqn-1889"><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is a given covariance matrix of the weight matrix <inline-formula id="ieqn-1890"><mml:math id="mml-ieqn-1890"><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. The covariance matrix <inline-formula id="ieqn-1891"><mml:math id="mml-ieqn-1891"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-370">370</xref>) has the same mathematical structure as Eq. (<xref ref-type="disp-formula" rid="eqn-337">337</xref>), and is therefore a reproducing kernel, with kernel function <inline-formula id="ieqn-1892"><mml:math id="mml-ieqn-1892"><mml:mi>k</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></p>
<p><disp-formula id="eqn-371"><label>(371)</label><mml:math id="mml-eqn-371" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>cov</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mtext>cov</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In the case of an isotropic covariance matrix <inline-formula id="ieqn-1893"><mml:math id="mml-ieqn-1893"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>, the kernel function <inline-formula id="ieqn-1894"><mml:math id="mml-ieqn-1894"><mml:mi>k</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> takes a simple form:<xref ref-type="fn" rid="fn240"><sup>240</sup></xref><fn id="fn240"><label>240</label><p>The &#x201C;precision&#x201D; is the inverse of the variance, i.e., <inline-formula id="ieqn-3297"><mml:math id="mml-ieqn-3297"><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 304.</p></fn></p>
<p><disp-formula id="eqn-372"><label>(372)</label><mml:math id="mml-eqn-372" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-373"><label>(373)</label><mml:math id="mml-eqn-373" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>with&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The Gaussian (normal) probability distribution <inline-formula id="ieqn-1895"><mml:math id="mml-ieqn-1895"><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-368">368</xref>) is the <italic>prior</italic> probability distribution for <inline-formula id="ieqn-1896"><mml:math id="mml-ieqn-1896"><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:math></inline-formula>, before any conditioning with observed data (i.e., before specifying the actual observed values of the outputs in <inline-formula id="ieqn-1897"><mml:math id="mml-ieqn-1897"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>).</p>
<statement id="st8_4"><title>Remark 8.4.</title>
<p><italic>Zero mean</italic>. In a Gaussian process, the joint distribution Eq. (<xref ref-type="disp-formula" rid="eqn-368">368</xref>) over the outputs, i.e., the <inline-formula id="ieqn-1898"><mml:math id="mml-ieqn-1898"><mml:mi>n</mml:mi></mml:math></inline-formula> random variables in <inline-formula id="ieqn-1899"><mml:math id="mml-ieqn-1899"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, is defined completely by &#x201C;second-order statistics,&#x201D; i.e., the mean <inline-formula id="ieqn-1900"><mml:math id="mml-ieqn-1900"><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> and the covariance <inline-formula id="ieqn-1901"><mml:math id="mml-ieqn-1901"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>. In practice, the mean of <inline-formula id="ieqn-1902"><mml:math id="mml-ieqn-1902"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is not available a priori, and &#x201C;so by symmetry, we take it to be zero&#x201D; [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 305, or equivalently, specify that the weights in <inline-formula id="ieqn-1903"><mml:math id="mml-ieqn-1903"><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> have zero mean, as in Eqs. (<xref ref-type="disp-formula" rid="eqn-365">365</xref>), (<xref ref-type="disp-formula" rid="eqn-367">367</xref>), (<xref ref-type="disp-formula" rid="eqn-369">369</xref>).&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-86">
<label>Figure 86</label>
<caption><title><italic>Gaussian process priors</italic> (Section <xref ref-type="sec" rid="s8_3">8.3</xref>). <italic>Left</italic>: Two samples with Gaussian kernel. <italic>Right:</italic> Two samples with Laplacian kernel. Parameters for both kernels: Kernel precision (inverse of variance) <inline-formula id="ieqn-760"><mml:math id="mml-ieqn-760"><mml:mi>&#x03B3;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>0.2</mml:mn></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>), isotropic noise variance <inline-formula id="ieqn-761"><mml:math id="mml-ieqn-761"><mml:msup><mml:mi>&#x03BD;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>6</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:math></inline-formula> added to covariance matrix <inline-formula id="ieqn-762"><mml:math id="mml-ieqn-762"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>y</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> of output <inline-formula id="ieqn-763"><mml:math id="mml-ieqn-763"><mml:mi>y</mml:mi></mml:math></inline-formula> and isotropic weight covariance matrix <inline-formula id="ieqn-764"><mml:math id="mml-ieqn-764"><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-374">374</xref>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-86.tif"/>
</fig>
<sec id="s8_3_1"><label>8.3.1</label>
<title>Gaussian-process priors and sampling</title>
<p>Instead of defining the kernel function <inline-formula id="ieqn-1904"><mml:math id="mml-ieqn-1904"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by selecting a basis of functions as in Eq. (<xref ref-type="disp-formula" rid="eqn-371">371</xref>), the kernel function can be directly defined using the analytical expressions in Table <xref ref-type="table" rid="table-5">5</xref> and Section <xref ref-type="sec" rid="s8_2">8.2</xref>. Figure <xref ref-type="fig" rid="fig-86">86</xref> provides a comparison of samples from two Gaussian-process priors, one with Gaussian kernel <inline-formula id="ieqn-1905"><mml:math id="mml-ieqn-1905"><mml:msub><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mi>G</mml:mi></mml:msub></mml:math></inline-formula> and one with Laplacian kernel <inline-formula id="ieqn-1906"><mml:math id="mml-ieqn-1906"><mml:msub><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mi>L</mml:mi></mml:msub></mml:math></inline-formula>, as given in Eq. (<xref ref-type="disp-formula" rid="eqn-358">358</xref>) with precision <inline-formula id="ieqn-1907"><mml:math id="mml-ieqn-1907"><mml:mi>&#x03B3;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. Since the Gram matrix <inline-formula id="ieqn-1908"><mml:math id="mml-ieqn-1908"><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> for a kernel <inline-formula id="ieqn-1909"><mml:math id="mml-ieqn-1909"><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow></mml:math></inline-formula> is positive semidefinite<xref ref-type="fn" rid="fn241"><sup>241</sup></xref><fn id="fn241"><label>241</label><p>That the Gram matrix is positive semidefinite should be familiar with practitioners of the finite element method, in which the Gram matrix is the stiffness matrix (for elliptic differential operators), and is positive semidefinite before applying the essential boundary conditions.</p></fn> ([<xref ref-type="bibr" rid="ref-130">130</xref>], p. 295), to make the Gram matrix <inline-formula id="ieqn-1910"><mml:math id="mml-ieqn-1910"><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-367">367</xref>) positive definite before performing a Choleski decomposition, a noise with isotropic variance <inline-formula id="ieqn-1911"><mml:math id="mml-ieqn-1911"><mml:mrow><mml:msup><mml:mi>v</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>I</mml:mi></mml:mrow></mml:math></inline-formula> is added to <inline-formula id="ieqn-1912"><mml:math id="mml-ieqn-1912"><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> ([<xref ref-type="bibr" rid="ref-130">130</xref>], p. 314), meaning that the noise is the same and independent in each direction. Then the sample function values can be obtained as follows ([<xref ref-type="bibr" rid="ref-130">130</xref>], p. 528):</p>
<p><disp-formula id="eqn-374"><label>(374)</label><mml:math id="mml-eqn-374" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BD;</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03BD;</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mi>&#x03BD;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03BD;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;then&#xA0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03BD;</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In other words, the GP prior samples in Figure <xref ref-type="fig" rid="fig-86">86</xref> were drawn from the Gaussian distribution with zero mean and covariance matrix <inline-formula id="ieqn-1913"><mml:math id="mml-ieqn-1913"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BD;</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-374">374</xref>):</p>
<p><disp-formula id="eqn-375"><label>(375)</label><mml:math id="mml-eqn-375" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>:=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BD;</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03BD;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-1914"><mml:math id="mml-ieqn-1914"><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> contains the test inputs, and <inline-formula id="ieqn-1915"><mml:math id="mml-ieqn-1915"><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula> contains the predictive values of <inline-formula id="ieqn-1916"><mml:math id="mml-ieqn-1916"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> at <inline-formula id="ieqn-1917"><mml:math id="mml-ieqn-1917"><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula>.</p>
<p>It can be observed from Figure <xref ref-type="fig" rid="fig-86">86</xref> that samples obtained with the Gaussian kernel were smooth, with slow variations, whereas samples obtained with the Laplacian kernel had high jiggling with rapid variations, appropriate to model Brownian motion; see Footnote <xref ref-type="fn" rid="fn237">237</xref>.</p>
</sec>
<sec id="s8_3_2"><label>8.3.2</label>
<title>Gaussian-process posteriors and sampling</title>
<p>Let <inline-formula id="ieqn-1918"><mml:math id="mml-ieqn-1918"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> be the observed data (or target values) at the training points <inline-formula id="ieqn-1919"><mml:math id="mml-ieqn-1919"><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, and let <inline-formula id="ieqn-1920"><mml:math id="mml-ieqn-1920"><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi>m</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> contain the <inline-formula id="ieqn-1921"><mml:math id="mml-ieqn-1921"><mml:mover accent='true'><mml:mi>m</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> test input points, with the predictive values in <inline-formula id="ieqn-1922"><mml:math id="mml-ieqn-1922"><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mover accent='true'><mml:mi>m</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. The combined function values in the matrix <inline-formula id="ieqn-1923"><mml:math id="mml-ieqn-1923"><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mover accent='true'><mml:mi>m</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, as random variables, are distributed &#x201C;normally&#x201D; (Gaussian), i.e.,</p>
<fig id="fig-87">
<label>Figure 87</label>
<caption><title><italic>Gaussian process prior and posterior samplings, Gaussian kernel</italic> (Section <xref ref-type="sec" rid="s8_3">8.3</xref>). <italic>Top left:</italic> Gaussian-prior samples (Section <xref ref-type="sec" rid="s8_3_1">8.3.1</xref>). The shaded red zones represent the predictive density of at each input location. <italic>Top right:</italic> Gaussian-posterior samples with 1 data point. <italic>Bottom left:</italic> Gaussian-posterior samples with 2 data points. <italic>Bottom right:</italic> Gaussian-posterior samples with 3 data points [<xref ref-type="bibr" rid="ref-247">247</xref>]. See Figure <xref ref-type="fig" rid="fig-88">88</xref> for the noise effects and Figure <xref ref-type="fig" rid="fig-89">89</xref> for animation of GP priors and posteriors. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-87.tif"/>
</fig>
<p><disp-formula id="eqn-376"><label>(376)</label><mml:math id="mml-eqn-376" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03BD;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The Gaussian-process posterior distribution, i.e., the conditional Gaussian distribution for the test output <inline-formula id="ieqn-1924"><mml:math id="mml-ieqn-1924"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> given the training data <inline-formula id="ieqn-1925"><mml:math id="mml-ieqn-1925"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is then (See Appendix <xref ref-type="sec" rid="s17">3</xref> for the detailed derivation, which is simpler than in [<xref ref-type="bibr" rid="ref-130">130</xref>] and in [<xref ref-type="bibr" rid="ref-248">248</xref>])<xref ref-type="fn" rid="fn242"><sup>242</sup></xref><fn id="fn242"><label>242</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-234">234</xref>], p. 16, [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 87, [<xref ref-type="bibr" rid="ref-247">247</xref>], p. 4. The authors of [<xref ref-type="bibr" rid="ref-234">234</xref>], in their Appendix A.2, p. 200, referred to [<xref ref-type="bibr" rid="ref-248">248</xref>] &#x201C;sec. 9.3&#x201D; for the derivation of Eqs. (<xref ref-type="disp-formula" rid="eqn-378">378</xref>)-(<xref ref-type="disp-formula" rid="eqn-379">379</xref>), but there were several sections numbered &#x201C;9.3&#x201D; in [<xref ref-type="bibr" rid="ref-248">248</xref>]; the correct referencing should be [<xref ref-type="bibr" rid="ref-248">248</xref>], Chapter XIII &#x201C;More on distributions,&#x201D; Sec. 9.3 &#x201C;Marginal distributions and conditional distributions,&#x201D; p. 427.</p></fn></p>
<p><disp-formula id="eqn-377"><label>(377)</label><mml:math id="mml-eqn-377" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-378"><label>(378)</label><mml:math id="mml-eqn-378" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03BD;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-379"><label>(379)</label><mml:math id="mml-eqn-379" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03BD;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the mean was set to zero by Remark <xref ref-type="statement" rid="st8_4">8.4</xref>. In Figure <xref ref-type="fig" rid="fig-87">87</xref>, the number <inline-formula id="ieqn-1926"><mml:math id="mml-ieqn-1926"><mml:mi>m</mml:mi></mml:math></inline-formula> of training points varied from 1 to 3. The Gaussian posterior sampling follows the same method as in Eq. (<xref ref-type="disp-formula" rid="eqn-374">374</xref>), but with the covariance matrix in Eq. (<xref ref-type="disp-formula" rid="eqn-379">379</xref>):<xref ref-type="fn" rid="fn243"><sup>243</sup></xref><fn id="fn243"><label>243</label><p>The Matlab <ext-link ext-link-type="uri" xlink:href="https://github.com/duvenaud/phd-thesis/blob/62ff4d61f27e7f83bc968324cc1f864a1ea7344c/code/plot_oned_gp.m">code</ext-link> for generating Figures <xref ref-type="fig" rid="fig-87">87</xref> and <xref ref-type="fig" rid="fig-88">88</xref> was provided courtesy of David Duvenaud. On line 25 of the code, the noise variance <inline-formula id="ieqn-3298"><mml:math id="mml-ieqn-3298"><mml:mi>n</mml:mi><mml:msup><mml:mi>u</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> was set as <monospace>sigma = 0.02</monospace>, which was much larger than <inline-formula id="ieqn-3299"><mml:math id="mml-ieqn-3299"><mml:mi>n</mml:mi><mml:msup><mml:mi>u</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>6</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> used in Figure <xref ref-type="fig" rid="fig-86">86</xref>.</p></fn></p>
<p><disp-formula id="eqn-380"><label>(380)</label><mml:math id="mml-eqn-380" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;then&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-88">
<label>Figure 88</label>
<caption><title><italic>Gaussian process posterior samplings, noise effects</italic> (Section <xref ref-type="sec" rid="s8_3">8.3</xref>). Not all sampled curves in Figure <xref ref-type="fig" rid="fig-87">87</xref> went through the data points, such as the black line in the present zoomed-in view of the bottom-left subfigure [<xref ref-type="bibr" rid="ref-247">247</xref>]. It is easy to make the sampled curves passing closer to the data points simply by reducing the noise variance <inline-formula id="ieqn-765"><mml:math id="mml-ieqn-765"><mml:mi>n</mml:mi><mml:msup><mml:mi>u</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> in Eqs. (<xref ref-type="disp-formula" rid="eqn-376">376</xref>), (<xref ref-type="disp-formula" rid="eqn-378">378</xref>), (<xref ref-type="disp-formula" rid="eqn-379">379</xref>). See Figure <xref ref-type="fig" rid="fig-89">89</xref> for animation of GP priors and posteriors. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-88.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s9"><label>9</label>
<title>Deep-learning libraries, frameworks, platforms</title>
<p>Among factors that drove the resurgence of AI, see Section <xref ref-type="sec" rid="s2">2</xref>, the availability of effective computing hardware and libraries that facilitate leveraging that hardware for DL-purposes have played and continue to play an important role in terms of both expansion of research efforts and dissemination in applications. Both commercial software, but primarily open-source libraries, which are backed by major players in software industry and academia, have emerged over the last decade, see, e.g., Wikipedia&#x2019;s &#x201C;Comparison of deep learning software&#x201D; <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Comparison_of_deep_learning_software&amp;oldid=1105085605">version 12:51, 18 August 2022</ext-link>.</p>
<p>Figure <xref ref-type="fig" rid="fig-90">90</xref> compares the popularity (as of 2018) of various software frameworks of the DL-realm by means of their &#x201C;Power Scores&#x201D;.<xref ref-type="fn" rid="fn244"><sup>244</sup></xref><fn id="fn244"><label>244</label><p>See [<xref ref-type="bibr" rid="ref-249">249</xref>].</p></fn> The &#x201C;Power Score&#x201D; metric was computed from the occurrences of the respective libraries on 11 different websites, which range from scientific ones, e.g., as world&#x2019;s largest storage for research articles and preprints <ext-link ext-link-type="uri" xlink:href="https://arxiv.org">arXiv</ext-link> to social media outlets as, e.g., <ext-link ext-link-type="uri" xlink:href="https://linkedin.com">Linkedin</ext-link>.</p>
<p>The impressive pace at which DL research and applications progress is also reflected in changes the software landscape has been subjected to. As of 2018, TensorFlow was clearly dominant, see Figure <xref ref-type="fig" rid="fig-90">90</xref>, whereas Theano was the only library around among those only five years earlier according to the author of the 2018 study.</p>
<p>As of August 2022, the picture has once again changed quite a bit. Using Google Trends as metric,<xref ref-type="fn" rid="fn245"><sup>245</sup></xref><fn id="fn245"><label>245</label><p>See <ext-link ext-link-type="uri" xlink:href="https://bit.ly/3KPgWFx">this link</ext-link> for the latest Google Trends results corresponding to Figure <xref ref-type="fig" rid="fig-91">91</xref>.</p></fn> PyTorch, which was in third place in 2018, has taken the leading position from TensorFlow as most popular DL-related software framework, see Figure <xref ref-type="fig" rid="fig-91">91</xref>.</p>
<fig id="fig-89">
<label>Figure 89</label>
<caption><title><italic>Gaussian process posterior samplings, animation</italic> (Section <xref ref-type="sec" rid="s8_3">8.3</xref>). <ext-link ext-link-type="uri" xlink:href="http://www.infinitecuriosity.org/vizgp/.tif">Interactive Gaussian Process Visualization</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://www.infinitecuriosity.org/.tif">Infinite curiosity</ext-link>. Click on the plot area to specify data points. See Figures <xref ref-type="fig" rid="fig-87">87</xref> and <xref ref-type="fig" rid="fig-88">88</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-89.tif"/>
</fig>
<p>Although the individual software frameworks do differ in terms of functionality, scope and internals, their overall purpose are clearly the same, i.e., to facilitate creation and training of neural networks and to harness the computational power of parallel computing hardware as, e.g., GPUs. For this reason, libraries share the following ingredients, which are essentially similar but come in different styles:</p>
<list list-type="bullet">
<list-item><p><italic>Linear algebra:</italic> In essence, DL boils down to algebraic operations on large sets of data arranged in multi-dimensional arrays, which are supported by all software frameworks, see Section <xref ref-type="sec" rid="s4_4">4.4</xref>.<xref ref-type="fn" rid="fn246"><sup>246</sup></xref><fn id="fn246"><label>246</label><p>In the context of DL, multi-dimensional arrays are also referred to a tensors, although the data often lacks the defining properties of tensors as algebraic objects. The software framework <italic>TensorFlow</italic> even reflects that by its name.</p></fn></p></list-item>
<list-item><p><italic>Back-propagation:</italic> Gradient-based optimization relies on efficient evaluation of derivatives of loss functions with respect to network parameters. The representation of algebraic operations as computational graphs allows for automatic differentiation, which is typically performed in <italic>reverse-mode</italic>, hence, back-propagation, see Section <xref ref-type="sec" rid="s5">5</xref>.</p></list-item>
<list-item><p><italic>Optimization:</italic> DL-libraries provide a variety of optimization algorithms that have proven effective in training of neural networks, see Section <xref ref-type="sec" rid="s6">6</xref>.</p></list-item>
<list-item><p><italic>Hardware-acceleration:</italic> Training deep neural networks is computationally intensive and requires adequate hardware, which allows algebraic computations to be performed in parallel. DL-software frameworks support various kinds of parallel/distributed hardware ranging from multi-threaded CPUs to GPUs to DL-specific hardware as TPUs. Parallelism is not restricted to algebraic computations only, but data also has to be efficiently loaded from storage and transferred to computing units.</p></list-item>
<list-item><p><italic>Frontend and API:</italic> Popular DL-frameworks provide an intuitive API, which supports accessibility for first-time learners and dissemination of novel methods to scientific fields beyond computer science. <italic>Python</italic> has become the prevailing programming language in DL, since it is more approachable for less-proficient developers as compared to languages traditionally popular in computational science as, e.g., C++ or Fortran. High-level APIs provide all essential building blocks (layers, activations, loss functions, optimizers, etc.) for both construction and training of complex network topologies and fully abstract the underlying algebraic operations from users.</p></list-item></list>
<fig id="fig-90">
<label>Figure 90</label>
<caption><title>Top deep-learning libraries in 2018 by the &#x201C;Power Score&#x201D; in [<xref ref-type="bibr" rid="ref-249">249</xref>]. By 2022, using Google Trends, the popularity of different frameworks is significantly different; see Figure <xref ref-type="fig" rid="fig-91">91</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-90.tif"/>
</fig>
<p>In what follows, a brief description of some of the most popular software frameworks is given.</p>
<sec id="s9_1"><label>9.1</label>
<title>TensorFlow</title>
<p><italic>TensorFlow</italic> [<xref ref-type="bibr" rid="ref-250">250</xref>] is a free and open-source software library which is being developed by the <italic>Google Brain</italic> research team, which, in turn, is part of Google&#x2019;s AI division. TensorFlow emerged from Google&#x2019;s proprietary predecessor &#x201C;DistBelief&#x201D; and was released to public in November 2015. Ever since its release, TensorFlow has rapidly become the most popular software framework in the field of deep-learning and maintains a leading position as of 2022, although it has been outgrown in popularity by its main competitor PyTorch particularly in research.</p>
<p>In 2016, Google presented its own AI accelerator hardware for TensorFlow called &#x201C;Tensor Processing Unit&#x201D; (TPU), which is built around an application-specific integrated circuited (ASIC) tailored to computations needed in training and evaluation of neural networks. DeepMind&#x2019;s grandmaster-beating software AlphaGo, see Section <xref ref-type="sec" rid="s2_3">2.3</xref> and Figure <xref ref-type="fig" rid="fig-2">2</xref>, was trained using TPUs.<xref ref-type="fn" rid="fn247"><sup>247</sup></xref><fn id="fn247"><label>247</label><p>See Google&#x2019;s announcement of TPUs [<xref ref-type="bibr" rid="ref-251">251</xref>].</p></fn> TPUs were made available to the public as part of &#x201C;Google Cloud&#x201D; in 2018. A single fourth-generation TPU device has a peak computing power of 275 teraflops for 16 bit floating point numbers (<monospace>bfloat16</monospace>) and 8 bit integers (<monospace>int8</monospace>). A fourth-generation cloud TPU &#x201C;pod&#x201D;, which comprises 4096 TPUs offers a peak computing power of 1.1 exaflops.<xref ref-type="fn" rid="fn248"><sup>248</sup></xref><fn id="fn248"><label>248</label><p>Recall that the &#x2018;exa&#x2019;-prefix translates into a factor of 1 &#x00D7; 10<sup>18</sup>. For comparison, Nvidia&#x2019;s latest H100 GPU-based accelerator has a half-precision floating point (<monospace>bfloat16</monospace>) performance of 1 petaflops.</p></fn></p>
<fig id="fig-91">
<label>Figure 91</label>
<caption><title><italic>Google Trends of deep-learning software libraries</italic> (Section <xref ref-type="sec" rid="s9">9</xref>). The chart shows the popularity of five DL-related software libraries most &#x201C;powerful&#x201D; in 2018 over the last 5 years (as of July 2022). See also Figure <xref ref-type="fig" rid="fig-90">90</xref> for the rankings of DL frameworks in 2018.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-91.tif"/>
</fig>
<sec id="s9_1_1"><label>9.1.1</label>
<title>Keras</title>
<p><italic>Keras</italic> [<xref ref-type="bibr" rid="ref-252">252</xref>] plays a special role among the software frameworks discussed here. As a matter of fact, it is not a full-featured DL-library, much rather Keras can be considered as an interface to other libraries providing a high-level API, which was originally built for various backends including TensorFlow, Theano and the (now deprecated) Microsoft Cognitive Toolkit (CNTK). As of version 2.4, TensorFlow is the only supported framework. Keras, which is free and also open-source, is meant to further simplify experimentation with neural networks as compared to TensorFlow&#x2019;s lower level API.</p></sec></sec>
<sec id="s9_2"><label>9.2</label>
<title>PyTorch</title>
<p>PyTorch [<xref ref-type="bibr" rid="ref-253">253</xref>], which is a free and open-source library, which was originally released to public in January 2017.<xref ref-type="fn" rid="fn249"><sup>249</sup></xref><fn id="fn249"><label>249</label><p>See this blog post on the history of PyTorch [<xref ref-type="bibr" rid="ref-254">254</xref>] and the YouTube talk of Yann LeCun, PyTorch co-creator Sousmith Chintala, Meta&#x2019;s PyTorch lead Lin Qiao and Meta&#x2019;s CTO Mike Schroepfer [<xref ref-type="bibr" rid="ref-255">255</xref>].</p></fn> As of 2022, PyTorch has evolved from a research-oriented DL-framework to a fully fledged environment for both scientific work and industrial applications, which, as of 2022 has caught up with, if not surpassed, TensorFlow in popularity. Primarily addressing researchers in its early days, PyTorch saw a rapid growth not least for the&#x2013;at that time&#x2013;unique feature of <italic>dynamic</italic> computational graphs, which allows for great flexibility and simplifies creation of complex network architectures. As opposed to its competitors as, e.g., TensorFlow, computational graphs, which represent compositions of mathematical operations and allow for automatic differentiation of complex expressions, are created on the fly, i.e., at the very same time as operations are performed. <italic>Static</italic> graphs, on the other hand, need to be created in a first step, before they can be evaluated and automatically differentiated. Some examples of applying PyTorch to computational mechanics are provided in the next two remarks.</p> 
<statement id="st9_1"><title>Remark 9.1.</title>
<p><italic>Reinforcement Learning (RL)</italic> is a branch of machine-learning, in which computational methods and DL-methods naturally come together. Owing to the progress in DL, reinforcement learning, which has its roots in the early days of cybernetics and machine learning, see, e.g., the survey [<xref ref-type="bibr" rid="ref-256">256</xref>], has gained attention in the fields of automatic control and robotics again. In their opening to a more recent review, the authors of [<xref ref-type="bibr" rid="ref-257">257</xref>] expect no less than <italic>&#x201C;deep reinforcement-learning is poised to revolutionize artificial intelligence and represents a step toward building autonomous systems with a higher level understanding of the visual world.&#x201D;</italic> RL is based on the concept that an autonomous agent learns complex tasks by trial-and-error. Interacting with its environment, the agent receives a reward if it succeeds in solving a given tasks. Not least to speed up training by means of the parallelization, simulation has become are key ingredient to modern RL, where agents are typically trained in virtual environments, i.e., simulation models of the physical world. Though (computer) games are classical benchmarks, in which DeepMind&#x2019;s AlphaGo and AlphaZero models excelled humans (see Section <xref ref-type="sec" rid="s1">1</xref>, Figure <xref ref-type="fig" rid="fig-2">2</xref>), deep RL has proven capable of dealing with real-world applications in the field of control and robotics, see, e.g., [<xref ref-type="bibr" rid="ref-258">258</xref>]. Based on the PyTorch&#x2019;s introductory tutorial (see <ext-link ext-link-type="uri" xlink:href="https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html">Original website</ext-link>) of a the classic cart-pole problem, i.e., an inverted pendulum (pole) mounted to a moving base (cart), we developed a RL-model for the control of large-deformable beams, see Figure <xref ref-type="fig" rid="fig-92">92</xref> and the <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/alexander.humer/cmes-dl-review/-/blob/main/rl-flexible-beam/rl-flexible-beam.mp4">video illustrating the training progress</ext-link>. For some large-deformable beam formulations, see, e.g., [<xref ref-type="bibr" rid="ref-259">259</xref>], [<xref ref-type="bibr" rid="ref-260">260</xref>], [<xref ref-type="bibr" rid="ref-261">261</xref>], [<xref ref-type="bibr" rid="ref-262">262</xref>].&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-92">
<label>Figure 92</label>
<caption><title><italic>Positioning and pointing control of large deformable beam</italic> (Section <xref ref-type="sec" rid="s9">9</xref>, Remark <xref ref-type="statement" rid="st9_1">9.1</xref>). Reinforcement learning. The agent is trained to align the tip of the flexible beam with the target position (red ball). For this purpose, the agent can move the base of the cantilever; the environment returns the negative Euclidean distance of the beam&#x2019;s tip to the target position as &#x201C;reward&#x201D; in each time-step of the simulation. <ext-link ext-link-type="uri" xlink:href="https://www.techscience.com/uploads/video/cmes/2022/reinforcement%20learning.mp4.tif">Simulation video</ext-link>. See also <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/alexander.humer/cmes-dl-review.tif">GitLab</ext-link> repository for code and video.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-92.tif"/>
</fig>
</sec>
<sec id="s9_3"><label>9.3</label>
<title>JAX</title>
<p>JAX [<xref ref-type="bibr" rid="ref-263">263</xref>] is a free and open-source research project driven by Google. Released to the public in 2018, JAX is one of the more recent software frameworks that have emerged during the current wave of AI. It is being described as being &#x201C;Autograd and XLA&#x201D; and as &#x201C;a language for expressing and composing transformations of numerical programs&#x201D;, i.e., JAX focuses on accelerating evaluations of algebraic expression and, in particular, gradient computations. As a matter of fact, its core API, which provides a mostly NumPy-compatible interface to many mathematical operations, is rather trimmed-down in terms of DL-specific functions as compared to the broad scope of functionality offered by TensorFlow and PyTorch, for instance, which, for this reason, are often referred to as <italic>end-to-end</italic> frameworks. JAX, on the other hand, considers itself as system that facilitates &#x201C;transformations&#x201D; like gradient computation, just-in-time compilation and automatic vectorization of compositions of functions on parallel hardware as GPUs and TPUs. A higher-level interface to JAX&#x2019; functionality, which is specifically made for ML-purposes, is available through the <italic>FLAX</italic> framework [<xref ref-type="bibr" rid="ref-264">264</xref>]. FLAX rovides many fundamental building blocks essential for creation and training of neural networks.</p>
</sec>
<sec id="s9_4"><label>9.4</label>
<title>Leveraging DL-frameworks for scientific computing</title>
<p>Software frameworks for deep-learning as, e.g., PyTorch, TensorFlow and JAX share several features which are also essential in scientific computing, in general, and finite-element analysis, in particular. These DL-frameworks are highly optimized in terms of vectorization and parallelization of algebraic operations. Within finite-element methods, parallel evaluations can be exploited in several respects: First and foremost, residual vectors and (tangent) stiffness matrices need to be repeatedly evaluated for all elements of finite-element mesh, into which the domain of interest is discretized. Secondly, the computation of each of these vectors and matrices is based upon numerical quadrature (see Section <xref ref-type="sec" rid="s10_3_2">10.3.2</xref> for a DL-based approach to improve quadrature), which, from an algorithmic point of view, is computed as a weighted sum of integrands evaluated at a finite set of points. A further key component that proves advantages in complex FE-problems is automatic differentiation, which, in the form of backpropagation (i.e., reverse-mode automatic differentiation), is the backbone of gradient-based training of neural networks, see Section <xref ref-type="sec" rid="s5">5</xref>. In the context of FE-problems in solid mechanics, automatic differentiation saves us from deriving and implementing derivatives of potentials, whose variation and linearization with respect to (generalized) coordinates give force vectors and tangent-stiffness matrices.<xref ref-type="fn" rid="fn250"><sup>250</sup></xref><fn id="fn250"><label>250</label><p>GPU-computing and automatic differentiation are by no means new to scientific computing not least in the field of computational mechanics. <italic>Project Chrono</italic> (see <ext-link ext-link-type="uri" xlink:href="https://www.projectchrono.org">Original website</ext-link>), for instance, is well known for its support of GPU-computing in problems of flexible multi-body and particle systems. The general purpose finite-element code <italic>Netgen/NGSolve</italic> [<xref ref-type="bibr" rid="ref-265">265</xref>] (see <ext-link ext-link-type="uri" xlink:href="https://www.ngsolve.org">Original website</ext-link>) offers a great degree of flexibility owing to its automatic differentiation capabilities. Well-established commercial codes, on the other hand, are often built on a comparatively old codebase, which dates back to times before the advent of GPU-computing.</p></fn></p>
<p>The potential of modern DL-software frameworks in conventional finite-element problems was studied in [<xref ref-type="bibr" rid="ref-266">266</xref>], where Netgen/NGSolve, which is a highly-optimized, OpenMP-parallel finite-element code written in C++, was compared against PyTorch and JAX implementations. In particular, the computational efficiency of the computing and assembling vectors of internal forces and tangent stiffness matrices of a hyperelastic solid was investigated. On the same (virtual) machine, it turned out that both PyTorch and JAX can compete with Netgen/NGSolve when computations are performed on CPU, see the timings shown in Figure <xref ref-type="fig" rid="fig-93">93</xref>. Moving computations to a GPU, the Python-based DL-frameworks outperformed Netgen/NGSolve in the evaluation of residual vectors. Regarding tangent-stiffness matrices, which are obtained through (automatic) second derivatives of the strain-energy function with respect to nodal coordinates, both PyTorch and JAX showed (different) bottlenecks, which, however, are likely to be sorted out in future releases.</p>
</sec>
<sec id="s9_5"><label>9.5</label>
<title>Physics-Informed Neural Network (PINN) frameworks</title>
<p>In laying out the roadmap for &#x201C;Simulation Intelligence&#x201D; (SI) the authors of [<xref ref-type="bibr" rid="ref-267">267</xref>] considered PINN as a key player in the first of the nine SI &#x201C;motifs,&#x201D; called &#x201C;Multi-physics &amp; multi-scale modeling.&#x201D;</p>
<p>The PINN method to solve differential equations (ODEs, PDEs) aims at training a neural network to minimize the total weighted loss <inline-formula id="ieqn-2049"><mml:math id="mml-ieqn-2049"><mml:mi>L</mml:mi></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-94">94</xref>, which describes the PINN concept in more technical details. As an example, the PDE residual <inline-formula id="ieqn-2050"><mml:math id="mml-ieqn-2050"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>PDE</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub></mml:math></inline-formula> for incompressible flow can be written as follows [<xref ref-type="bibr" rid="ref-268">268</xref>]:</p>
<p><disp-formula id="eqn-381"><label>(381)</label><mml:math id="mml-eqn-381" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>PDE</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>N</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-382"><label>(382)</label><mml:math id="mml-eqn-382" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>R</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mfrac><mml:msup><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2051"><mml:math id="mml-ieqn-2051"><mml:msub><mml:mi>N</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula> is the number of residual (collocation) points, which could be in the millions generated randomly,<xref ref-type="fn" rid="fn251"><sup>251</sup></xref><fn id="fn251"><label>251</label><p>For incompressible flow past a cylinder, the computational domain of dimension <inline-formula id="ieqn-3300"><mml:math id="mml-ieqn-3300"><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>7.5</mml:mn><mml:mo>,</mml:mo><mml:mn>28.5</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>20</mml:mn><mml:mo>,</mml:mo><mml:mn>20</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>12.5</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>&#x2013;with coordinates <inline-formula id="ieqn-3301"><mml:math id="mml-ieqn-3301"><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> non-dimensionalized by the diameter of the cylinder, with axis along the <inline-formula id="ieqn-3302"><mml:math id="mml-ieqn-3302"><mml:mi>z</mml:mi></mml:math></inline-formula> direction, and going through the point <inline-formula id="ieqn-3303"><mml:math id="mml-ieqn-3303"><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>&#x2013;contained <inline-formula id="ieqn-3304"><mml:math id="mml-ieqn-3304"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mn>6</mml:mn></mml:msup></mml:math></inline-formula> residual (collocation) points [<xref ref-type="bibr" rid="ref-268">268</xref>].</p></fn> the residual <inline-formula id="ieqn-2052"><mml:math id="mml-ieqn-2052"><mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-382">382</xref>)<inline-formula id="ieqn-2053"><mml:math id="mml-ieqn-2053"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> is the left-hand side of the balance of linear momentum (the right-hand side being zero), and the residual <inline-formula id="ieqn-2054"><mml:math id="mml-ieqn-2054"><mml:msub><mml:mi>r</mml:mi><mml:mn>4</mml:mn></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-382">382</xref>)<inline-formula id="ieqn-2055"><mml:math id="mml-ieqn-2055"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> is the left-hand side of the incomrepssibility constraint, and <inline-formula id="ieqn-2056"><mml:math id="mml-ieqn-2056"><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the space-time coordinates of the collocation point <inline-formula id="ieqn-2057"><mml:math id="mml-ieqn-2057"><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<fig id="fig-93">
<label>Figure 93</label>
<caption><title><italic>DL-frameworks in nonlinear finite-element problems</italic> (Section <xref ref-type="sec" rid="s9_4">9.4</xref>). The computational efficiency of a PyTorch-based (Version 1.8) finite-element code implemented was compared against the state-of-the-art general purpose Netgen/NGSolve [<xref ref-type="bibr" rid="ref-265">265</xref>] for a problem of nonlinear elasticity, see the <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/alexander.humer/cmes-dl-review/-/blob/main/asme_idetc_msndc_2021/presentation_idetc.pdf.tif">slides of the presentation</ext-link> and the corresponding <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/alexander.humer/cmes-dl-review/-/blob/main/asme_idetc_msndc_2021/humer_idetc.mp4.tif">video</ext-link>. The figures show timings (in seconds) for evaluations of the strain energy (top left), the internal forces (residual, top right) and element-stiffness matrices (bottom left) and the global stiffness matrix (bottom right) against the number of elements. Owing to PyTorch&#x2019;s parallel computation capacity, the simple Python implementation could compete with the highly-optimized finite-element code, in particular, as computations were moved to a GPU (NVIDIA Tesla V100).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-93.tif"/>
</fig>
<p>Some review papers on PINN are [<xref ref-type="bibr" rid="ref-269">269</xref>] [<xref ref-type="bibr" rid="ref-270">270</xref>], with the latter being more general than [<xref ref-type="bibr" rid="ref-268">268</xref>], which was restricted to fluid mechanics, and touching on many different fields. Table <xref ref-type="table" rid="table-6">6</xref> lists PINN frameworks that are currently actively developed, and a few selected <italic>solvers</italic> among which are summarized below.</p>
<list list-type="simple">
<list-item><label>&#x261B;</label><p><ext-link ext-link-type="uri" xlink:href="https://deepxde.readthedocs.io/en/latest/">DeepXDE</ext-link> [<xref ref-type="bibr" rid="ref-271">271</xref>], one of the first PINN framework and a <italic>solver</italic> (Table <xref ref-type="table" rid="table-6">6</xref>) was developed in Python, with a TensorFlow backend, for both teaching and research. This framework can solve both forward problems (&#x201C;given initial and boundary conditions&#x201D;) and inverse problems (&#x201C;given some extra measures&#x201D;), with domains having complex geometry. According to the authors, DeepXDE is user-friendly, with compact user code resembling the problem mathematical formulation, customizable to different types of mechanics problem. The site contains many published papers with a large number of demo problems: Poisson equation, Burgers equation, diffusion-reaction equation, wave propagation equation, fractional PDEs, etc. In addition, there are demos on inverse problems and operator learning. Three more backends beyond TensorFlow, which was reported in [<xref ref-type="bibr" rid="ref-270">270</xref>], have been added to DeepXDE: PyTorch, JAX, Paddle.<xref ref-type="fn" rid="fn252"><sup>252</sup></xref><fn id="fn252"><label>252</label><p>See Section <ext-link ext-link-type="uri" xlink:href="https://deepxde.readthedocs.io/en/latest/user/installation.html#working-with-different-backends">Working with different backends</ext-link>. See also [<xref ref-type="bibr" rid="ref-269">269</xref>].</p></fn></p></list-item>
<list-item><label>&#x261B;</label><p><ext-link ext-link-type="uri" xlink:href="https://github.com/NeuroDiffGym/neurodiffeq">NeuroDiffEq</ext-link> [<xref ref-type="bibr" rid="ref-274">274</xref>], a <italic>solver</italic>, was developed about the same time as DeepXDE, with the backend being PyTorch. Even though it was written that the authors were &#x201C;actively working on extending NeuroDiffEq to support three spatial dimensions,&#x201D; this feature is not ready, and can be worked around by including the 3D boundary conditions in the loss function.<xref ref-type="fn" rid="fn253"><sup>253</sup></xref><fn id="fn253"><label>253</label><p>&#x201C;All you need is to import <monospace>GenericSolver</monospace> from <monospace>neurodiffeq.solvers</monospace>, and <monospace>Generator3D</monospace> from <monospace>neurodiffeq.generators</monospace>. The catch is that currently there is no reparametrization defined in <monospace>neurodiffeq.conditions</monospace> to satisfy 3D boundary conditions,&#x201D; which can be hacked into the loss function &#x201C;by either adding another element in your equation system or overwriting the <monospace>additional_loss</monospace> method of <monospace>GenericSolve</monospace>.&#x201D; Private communication with a developer of NeuroDiffEq on 2022.10.08.</p></fn> Even though in principle, NeuroDiffEq can be used to solve PDEs of interest to engineering (e.g., Navier-Stokes solutions), there were no such examples in the official documentation, except for a 2D Laplace equation and a 1D heat equation. The backend is limited to PyTorch, and the site did not list any papers, either by the developers or by others, using this framework.</p></list-item>
<list-item><label>&#x261B;</label><p><ext-link ext-link-type="uri" xlink:href="https://neuralpde.sciml.ai/dev/">NeuralPDE</ext-link> [<xref ref-type="bibr" rid="ref-275">275</xref>], a <italic>solver</italic>, was developed in a relatively new language <ext-link ext-link-type="uri" xlink:href="https://julialang.org/">Julia</ext-link>, which is 20 years younger than Python, has a speed edge over Python in machine learning, but does not in data science.<xref ref-type="fn" rid="fn254"><sup>254</sup></xref><fn id="fn254"><label>254</label><p>There are many comparisons of Julia versus Python on the web; one is <ext-link ext-link-type="uri" xlink:href="https://blog.boot.dev/python/python-vs-julia/">Julia vs Python: Which is Best to Learn First?</ext-link>, by By Zulie Rane on 2022.02.05, updated on 2022.10.01.</p></fn> Demos are given for ODEs, generic PDEs, such as the coupled nonlinear hyperbolic PDEs of the form:</p></list-item></list>
<table-wrap id="table-6"><label>Table 6</label>
<caption>
<p><italic>PINN frameworks</italic> (<xref ref-type="sec" rid="s9_5">Section 9.5</xref>) being actively developed [<xref ref-type="bibr" rid="ref-270">270</xref>] [<xref ref-type="bibr" rid="ref-269">269</xref>]. A <italic>solver</italic> solves the problem defined by users. A <italic>wrapper</italic> does not solve, but only wraps low-level functions from other libraries (e.g., PyTorch) into high-level functions that are convenient for users to implement PINN to solve the problem.</p>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="left">Framework site &#x0026; Ref.</th>
<th align="left">Usage</th>
<th align="left">Language</th>
<th align="left">Backend</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://deepxde.readthedocs.io/en/latest/">DeepXDE</ext-link> [<xref ref-type="bibr" rid="ref-271">271</xref>]</td>
<td align="left">Solver</td>
<td align="left">Python</td>
<td align="left">TensorFlow, PyTorch, JAX, Paddle</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://developer.nvidia.com/modulus">NVIDIA Modulus (SimNet)</ext-link> [<xref ref-type="bibr" rid="ref-272">272</xref>]</td>
<td align="left">Solver</td>
<td align="left">Python</td>
<td align="left">TensorFlow</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://github.com/analysiscenter/pydens">PyDEns</ext-link> [<xref ref-type="bibr" rid="ref-273">273</xref>]</td>
<td align="left">Solver</td>
<td align="left">Python</td>
<td align="left">TensorFlow</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://github.com/NeuroDiffGym/neurodiffeq">NeuroDiffEq</ext-link> [<xref ref-type="bibr" rid="ref-274">274</xref>]</td>
<td align="left">Solver</td>
<td align="left">Python</td>
<td align="left">PyTorch</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://neuralpde.sciml.ai/dev/">NeuralPDE</ext-link> [<xref ref-type="bibr" rid="ref-275">275</xref>]</td>
<td align="left">Solver</td>
<td align="left">Julia</td>
<td align="left">Julia</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://www.sciann.com/">SciANN</ext-link> [<xref ref-type="bibr" rid="ref-276">276</xref>]</td>
<td align="left">Wrapper</td>
<td align="left">Python</td>
<td align="left">TensorFlow</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://kailaix.github.io/ADCME.jl/latest/">ADCME</ext-link> [<xref ref-type="bibr" rid="ref-277">277</xref>]</td>
<td align="left">Wrapper</td>
<td align="left">Julia</td>
<td align="left">TensorFlow</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://gpytorch.ai/">GPytorch</ext-link> [<xref ref-type="bibr" rid="ref-278">278</xref>]</td>
<td align="left">Wrapper</td>
<td align="left">Python</td>
<td align="left">PyTorch</td>
</tr>
<tr>
<td align="left"><ext-link ext-link-type="uri" xlink:href="https://github.com/google/neural-tangents">Neural Tangents</ext-link> [<xref ref-type="bibr" rid="ref-279">279</xref>]</td>
<td align="left">Wrapper</td>
<td align="left">Python</td>
<td align="left">JAX</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-94">
<label>Figure 94</label>
<caption><title><italic>Physics-Informed Neural Networks (PINN) concept</italic> (Section <xref ref-type="sec" rid="s9_5">9.5</xref>). The goal is to find the optimal network parameters <inline-formula id="ieqn-1927"><mml:math id="mml-ieqn-1927"><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> (weights) and PDE parameters <inline-formula id="ieqn-1928"><mml:math id="mml-ieqn-1928"><mml:msup><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:math></inline-formula> that minimize the total weighted loss function <inline-formula id="ieqn-1929"><mml:math id="mml-ieqn-1929"><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which is a linear combination of four loss functions: (1) The residual of the PDE, <inline-formula id="ieqn-1930"><mml:math id="mml-ieqn-1930"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>PDE</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub></mml:math></inline-formula>, (2) Loss due to initial conditions, <inline-formula id="ieqn-1931"><mml:math id="mml-ieqn-1931"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>IC</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub></mml:math></inline-formula>, (3) Loss due to boundary conditions, <inline-formula id="ieqn-1932"><mml:math id="mml-ieqn-1932"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>BC</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub></mml:math></inline-formula>, (4) Loss due to known (labeled) data, <inline-formula id="ieqn-1933"><mml:math id="mml-ieqn-1933"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>data</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub></mml:math></inline-formula>, with <inline-formula id="ieqn-1934"><mml:math id="mml-ieqn-1934"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:msub></mml:math></inline-formula> being the combination coefficients. With the space-time coordinates <inline-formula id="ieqn-1935"><mml:math id="mml-ieqn-1935"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as inputs, the neural network produces an approximated multi-physics solution <inline-formula id="ieqn-1936"><mml:math id="mml-ieqn-1936"><mml:mrow><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo>&#x007B;</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mi>p</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo>&#x007D;</mml:mo></mml:mrow></mml:math></inline-formula>, of which the derivatives, estimated by automatic differentiation, are used to evaluate the loss functions. If the total loss <inline-formula id="ieqn-1937"><mml:math id="mml-ieqn-1937"><mml:mi>L</mml:mi></mml:math></inline-formula> is not less than a tolerance, its gradients with respect to the parameters <inline-formula id="ieqn-1938"><mml:math id="mml-ieqn-1938"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> are used to update these parameters in a descent direction toward a local minimum of <inline-formula id="ieqn-1939"><mml:math id="mml-ieqn-1939"><mml:mi>L</mml:mi></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-268">268</xref>].</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-94.tif"/>
</fig>
<fig id="fig-95">
<label>Figure 95</label>
<caption><title>Coupled nonlinear hyperbolic equations (Section <xref ref-type="sec" rid="s9_5">9.5</xref>). Analytical solution, predicted solution by <ext-link ext-link-type="uri" xlink:href="https://neuralpde.sciml.ai/dev/">NeuralPDE</ext-link> [<xref ref-type="bibr" rid="ref-275">275</xref>] and error for the coupled nonlinear hyperbolic equations in Eq. (<xref ref-type="disp-formula" rid="eqn-383">383</xref>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-95.tif"/>
</fig>
<p><disp-formula id="eqn-383"><label>(383)</label><mml:math id="mml-eqn-383" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mfrac><mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>t</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mi>a</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mfrac><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>u</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:mi>u</mml:mi><mml:mi>w</mml:mi></mml:mfrac><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mfrac><mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>t</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mi>b</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mfrac><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>w</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:mi>u</mml:mi><mml:mi>w</mml:mi></mml:mfrac><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-2058"><mml:math id="mml-ieqn-2058"><mml:mi>f</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2059"><mml:math id="mml-ieqn-2059"><mml:mi>g</mml:mi></mml:math></inline-formula> being arbitrary functions. There are initial and boundary conditions, and exact solution to find the error of the numerical solution, Figure <xref ref-type="fig" rid="fig-95">95</xref>. The site did not list any papers, either by the developers or by others, using this framework.</p>
<p>Additional PINN software packages other than those in Table <xref ref-type="table" rid="table-6">6</xref> are listed and summarized in [<xref ref-type="bibr" rid="ref-269">269</xref>].</p> 
<statement id="st9_2"><title>Remark 9.2.</title>
<p><italic>PINN and activation functions</italic>. Deep neural networks (DNN), having at least two hidden layers, with ReLU activation function (Figure <xref ref-type="fig" rid="fig-24">24</xref>) were shown to correspond to linear finite element interpolation [<xref ref-type="bibr" rid="ref-280">280</xref>], since piecewise linear functions can be written as DNN with ReLU activation functions [<xref ref-type="bibr" rid="ref-281">281</xref>].</p>
<p>But using the <italic>strong</italic> form such as the PDE in Eq. (<xref ref-type="disp-formula" rid="eqn-383">383</xref>), which has the second partial derivative with respect to <inline-formula id="ieqn-2060"><mml:math id="mml-ieqn-2060"><mml:mi>x</mml:mi></mml:math></inline-formula>, and since the second derivative of ReLU is zero, activation functions such as the logistic sigmoid (Figure <xref ref-type="fig" rid="fig-30">30</xref>), hyperbolic tangent (Figure <xref ref-type="fig" rid="fig-31">31</xref>), or the swish function (Figure <xref ref-type="fig" rid="fig-139">139</xref>) are recommended. Because of the presence of the second partial derivatives with respect to the spatial coordinates in the general PDE Eq. (<xref ref-type="disp-formula" rid="eqn-384">384</xref>), particularized to the 2D Navier-Stokes Eq. (<xref ref-type="disp-formula" rid="eqn-385">385</xref>):</p>
<p><disp-formula id="eqn-384"><label>(384)</label><mml:math id="mml-eqn-384" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BB;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;with</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-385"><label>(385)</label><mml:math id="mml-eqn-385" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BB;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:msub><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>v</mml:mi><mml:msub><mml:mi>u</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>y</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:msub><mml:mi>v</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>v</mml:mi><mml:msub><mml:mi>v</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>y</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mrow><mml:mi>&#x03BB;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>for which the hyperbolic tangent (<inline-formula id="ieqn-2061"><mml:math id="mml-ieqn-2061"><mml:mrow><mml:mtext>tanh</mml:mtext></mml:mrow></mml:math></inline-formula>) was used as activation function [<xref ref-type="bibr" rid="ref-282">282</xref>].&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st9_3"><title>Remark 9.3.</title>
<p><italic>Variational PINN</italic>. Similar to the finite element method, in which the <italic>weak</italic> form, not the strong form as in Remark <xref ref-type="statement" rid="st9_2">9.2</xref>, of the PDE allows for a reduction in the requirement of differentiability of the trial solution, and is discretized with numerical integration used to evaluate the resulting coefficients for various matrices (e.g, mass, stiffness, etc), PINN can be formulated using the <italic>weak</italic> form, instead of the strong form such as Eq. (<xref ref-type="disp-formula" rid="eqn-385">385</xref>), at the expense of having to perform numerical integration (quadrature) [<xref ref-type="bibr" rid="ref-283">283</xref>] [<xref ref-type="bibr" rid="ref-284">284</xref>] [<xref ref-type="bibr" rid="ref-285">285</xref>].</p>
<p>Examples of 1-D PDEs were given in [<xref ref-type="bibr" rid="ref-283">283</xref>] in which the activation function was a sine function defined over the interval <inline-formula id="ieqn-2062"><mml:math id="mml-ieqn-2062"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. To illustrate the concept, consider the following simple 1-D problem [<xref ref-type="bibr" rid="ref-283">283</xref>] (which could model the axial displacement <inline-formula id="ieqn-2063"><mml:math id="mml-ieqn-2063"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of an elastic bar under distributed load <inline-formula id="ieqn-2064"><mml:math id="mml-ieqn-2064"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with prescribed end displacements):</p>
<p><disp-formula id="eqn-386"><label>(386)</label><mml:math id="mml-eqn-386" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>u</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2065"><mml:math id="mml-ieqn-2065"><mml:mi>g</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2066"><mml:math id="mml-ieqn-2066"><mml:mi>h</mml:mi></mml:math></inline-formula> are constants. Three variational forms of the strong form Eq. (<xref ref-type="disp-formula" rid="eqn-386">386</xref>) are:</p>
<p><disp-formula id="eqn-387"><label>(387)</label><mml:math id="mml-eqn-387" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mi>v</mml:mi><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-388"><label>(388)</label><mml:math id="mml-eqn-388" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msup><mml:mi>u</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-389"><label>(389)</label><mml:math id="mml-eqn-389" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msup><mml:mi>u</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-390"><label>(390)</label><mml:math id="mml-eqn-390" display="block"><mml:mrow><mml:msub><mml:mtext>A</mml:mtext><mml:mn>3</mml:mn></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mstyle displaystyle='true'><mml:mrow><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mi>u</mml:mi></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>v</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>v</mml:mi><mml:mo>&#x0027;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where the familiar symmetric operator A<sub>2</sub> in Eq. (<xref ref-type="disp-formula" rid="eqn-389">389</xref>) is the weak form, with the non-symmetric operator A<sub>1</sub> in Eq. (<xref ref-type="disp-formula" rid="eqn-388">388</xref>) retaining the second derivation on the solution <inline-formula id="ieqn-2069"><mml:math id="mml-ieqn-2069"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and the non-symmetric operator A<sub>3</sub> in Eq. (<xref ref-type="disp-formula" rid="eqn-390">390</xref>) retaining the second derivative on the test function <inline-formula id="ieqn-2071"><mml:math id="mml-ieqn-2071"><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in addition to the boundary terms. Upon replacing the solution <inline-formula id="ieqn-2072"><mml:math id="mml-ieqn-2072"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by its neural network (NN) approximation <inline-formula id="ieqn-2073"><mml:math id="mml-ieqn-2073"><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> obtained using one single hidden layer (<inline-formula id="ieqn-2074"><mml:math id="mml-ieqn-2074"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-23">23</xref>) with <inline-formula id="ieqn-2075"><mml:math id="mml-ieqn-2075"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and layer width <inline-formula id="ieqn-2076"><mml:math id="mml-ieqn-2076"><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>, and using the sine activation function on the interval <inline-formula id="ieqn-2077"><mml:math id="mml-ieqn-2077"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-391"><label>(391)</label><mml:math id="mml-eqn-391" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mrow></mml:munderover><mml:msub><mml:mi>c</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which does not satisfy the essential boundary conditions (whereas the solution <inline-formula id="ieqn-2078"><mml:math id="mml-ieqn-2078"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> does), the loss function for the VPINN method is then the squared residual of the variational form plus the squared residual of the essential boundary conditions:</p>
<p><disp-formula id="eqn-392"><label>(392)</label><mml:math id="mml-eqn-392" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-393"><label>(393)</label><mml:math id="mml-eqn-393" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>}</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmin</mml:mtext></mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:munder><mml:mtext>&#xA0;</mml:mtext><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2079"><mml:math id="mml-ieqn-2079"><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> are the optimal parameters for Eq. (<xref ref-type="disp-formula" rid="eqn-391">391</xref>), with the goal of enforcing the variational form and essential boundary conditions in Eq. (<xref ref-type="disp-formula" rid="eqn-387">387</xref>). More details are in [<xref ref-type="bibr" rid="ref-283">283</xref>] [<xref ref-type="bibr" rid="ref-284">284</xref>].</p>
<p>For a symmetric variational form such as <inline-formula id="ieqn-2080"><mml:math id="mml-ieqn-2080"><mml:msub><mml:mtext>A</mml:mtext><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-389">389</xref>), a potential energy exists, and can be minimized to obtain the neural approximate solution <inline-formula id="ieqn-2081"><mml:math id="mml-ieqn-2081"><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-394"><label>(394)</label><mml:math id="mml-eqn-394" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-395"><label>(395)</label><mml:math id="mml-eqn-395" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msup><mml:mo>}</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmin</mml:mtext></mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:munder><mml:mtext>&#xA0;</mml:mtext><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mover><mml:mi>J</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is similar to the approach taken in [<xref ref-type="bibr" rid="ref-280">280</xref>], where the ReLU activation function (Figure <xref ref-type="fig" rid="fig-24">24</xref>) was used, and where a constraint on the NN parameters was used to satisfy an essential boundary condition.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<statement id="st9_4"><title>Remark 9.4.</title>
<p><italic>PINN, kernel machines, training, convergence problems</italic>. There is a relationship between PINN and kernel machines in Section <xref ref-type="sec" rid="s8">8</xref>. Specifically, the neural tangent kernel [<xref ref-type="bibr" rid="ref-232">232</xref>], which &#x201C;captures the behavior of fully-connected neural networks in the infinite width limit during training via gradient descent&#x201D; was used to understand when and why PINN failed to train [<xref ref-type="bibr" rid="ref-286">286</xref>], whose authors found a &#x201C;remarkable discrepancy in the convergence rate of the different loss components contributing to the total training error,&#x201D; and proposed a new gradient descent algorithm to fix the problem.</p>
<p>It was often reported that PINN optimization converged to &#x201C;solutions that lacked physical behaviors,&#x201D; and &#x201C;reduced-domain methods improved convergence behavior of PINNs&#x201D;; see [<xref ref-type="bibr" rid="ref-287">287</xref>], where a dynamical system of the form below was studied:</p>
<p><disp-formula id="eqn-396"><label>(396)</label><mml:math id="mml-eqn-396" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-2082"><mml:math id="mml-ieqn-2082"><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> being called the &#x201C;physics&#x201D; loss function. An example with <inline-formula id="ieqn-2083"><mml:math id="mml-ieqn-2083"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>u</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> was studied to show that the &#x201C;physics loss optimization predominantly results in convergence issues, leading to incorrectly learned system dynamics&#x201D;; see [<xref ref-type="bibr" rid="ref-287">287</xref>], where it was found that &#x201C;solutions corresponding to nonphysical system dynamics [could] be dominant in the physics loss landscape and optimization,&#x201D; and that &#x201C;reducing the computational domain [lowered] the optimization complexity and chance of getting trapped with nonphysical solutions.&#x201D;</p>
<p>See also [<xref ref-type="bibr" rid="ref-288">288</xref>] for incorporating the Lyapunov stability concept into PINN formulation for CFD to &#x201C;improve the generalization error and reduce the prediction uncertainty.&#x201D;&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st9_5"><title>Remark 9.5.</title>
<p><italic>PINN and attention architecture</italic>. In [<xref ref-type="bibr" rid="ref-228">228</xref>], PIANN, Physics-Informed Attention-based Neural Network, was proposed to connect PINN to attention architecture, discussed in Section <xref ref-type="sec" rid="s7_4_3">7.4.3</xref> to solve hyperbolic PDE with shock wave. See Remark <xref ref-type="statement" rid="st7_7">7.7</xref> and Remark <xref ref-type="statement" rid="st11_11">11.11</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<statement id="st9_6"><title>Remark 9.6.</title>
<p><italic>&#x201C;Physics-Informed Learning Machine&#x201D; (PILM) 2021 US Patent</italic> [<xref ref-type="bibr" rid="ref-289">289</xref>]. First note that the patent title used the phrase &#x201C;learning machine,&#x201D; instead of &#x201C;machine learning,&#x201D; indicating that the emphasis of the patent appeared to be on &#x201C;machine,&#x201D; instead of on &#x201C;learning&#x201D; [<xref ref-type="bibr" rid="ref-289">289</xref>]. PINN was not mentioned, as it was first invented in [<xref ref-type="bibr" rid="ref-290">290</xref>] [<xref ref-type="bibr" rid="ref-291">291</xref>], which were cited by the patent authors in their original PINN paper [<xref ref-type="bibr" rid="ref-282">282</xref>].<xref ref-type="fn" rid="fn255"><sup>255</sup></xref><fn id="fn255"><label>255</label><p>The 2019 paper [<xref ref-type="bibr" rid="ref-282">282</xref>] was a merger of a two-part preprint [<xref ref-type="bibr" rid="ref-292">292</xref>] [<xref ref-type="bibr" rid="ref-293">293</xref>].</p></fn> The abstract of this 2021 PILM US Patent [<xref ref-type="bibr" rid="ref-289">289</xref>] reads as follows:</p>
<p><disp-quote><p>&#x201C;A method for analyzing an object includes modeling the object with a differential equation, such as a linear partial differential equation (PDE), and sampling data associated with the differential equation. The method uses a probability distribution device to obtain the solution to the differential equation. The method eliminates use of discretization of the differential equation.&#x201D;</p></disp-quote></p>
<p>The first sentence is nothing new to the readers. In the second sentence, a &#x201C;probability distribution device&#x201D; could be replaced by a neural network, which would make PILM into PINN. This patent mainly focused on the Gaussian processes (Section <xref ref-type="sec" rid="s8_3">8.3</xref>), as an exemple of probability distribution (see Figure <xref ref-type="fig" rid="fig-4">4</xref> in [<xref ref-type="bibr" rid="ref-289">289</xref>]). The third sentence would be the claim-to-fame of PILM, and also of PINN.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st9_7"><title>Remark 9.7.</title>
<p><italic>Using PINN frameworks</italic>. While undergraduates with limited knowledge on the theory of the Finite Element Method could run FE Analysis of complicated structures and complex domain geometries on a laptop using commercial FE codes, solving problems with exceedingly simple domain geometry using a PINN framework such as DeepXDE does require knowledge of governing PDEs, initial and boundary conditions, artificial neural networks and frameworks (such as PyTorch, TensorFlow, etc.), the Python language, and having a more powerful computer. In addition, because there are many parameters to fiddle with, beyond the sample problems posted on the DeepXDE website, first-time users could encounter disappointment and doubt when trying to solve a new problem. It is not clear when the PINN methods would reach the level of FE commercial codes that undergraduates could use, or would they just fade away after an initial period of excitement like the meshless methods before them.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
</sec>
</sec>
<sec id="s10"><label>10</label>
<title>Application 1: Enhanced numerical quadrature for finite elements</title>
<p>The results and deep-learning concepts used in [<xref ref-type="bibr" rid="ref-38">38</xref>] were presented in Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref> above. In this section, we discuss some details of the formulation.</p>
<p>The finite element method (FEM) has become the most important numerical method for the approximation of solutions to partial differential equations, in particular, the governing equations in solid mechanics. As for any mesh-based method, the discretization of a continuous (spatial, temporal) domain into a finite-element mesh, i.e., a disjoint set of finite elements, is a vital ingredient that affects the quality of the results and therefore has emerged as a field of research on its own. Being based on the weak formulation of the governing balance equations, numerical integration is a second key ingredient of FEM, in which integrals over the physical domain of interest are approximated by the sum of integrals over the individual finite elements. In real-world problems, regularly shaped elements, e.g., triangles and rectangles in 2-D, tetrahedra and hexahedra in 3-D, typically no longer suffice to represent the complex shape of bodies or physical domains. By distorting basic element shapes, finite elements of more arbitrary shapes are obtained, while the interpolation functions of the &#x201C;parent&#x201D; elements can be retained.<xref ref-type="fn" rid="fn256"><sup>256</sup></xref><fn id="fn256"><label>256</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-39">39</xref>], p.61, Section &#x201C;3.5 Isoparametric form&#x201D; and p.170, Section &#x201C;6.5 Mapping: Parametric forms.&#x201D;</p></fn> The mapping represents a coordinate transformation by which the coordinates of a &#x201C;parent&#x201D; or &#x201C;reference&#x201D; element are mapped onto distorted, possibly curvilinear, physical coordinates of the actual elements in the mesh. Conventional finite element formulations use polynomials as interpolation functions, for which Gauss-Legendre quadrature is the most efficient way to integrate numerically. Efficiency in numerical quadrature is immediately related the (finite) number of integration points that is required to exactly integrate a polynomial of a given degree.<xref ref-type="fn" rid="fn257"><sup>257</sup></xref><fn id="fn257"><label>257</label><p>Using Gauss-Legendre quadrature, <inline-formula id="ieqn-3305"><mml:math id="mml-ieqn-3305"><mml:mi>p</mml:mi></mml:math></inline-formula> integration points integrate polynomials up to a degree of <inline-formula id="ieqn-3306"><mml:math id="mml-ieqn-3306"><mml:mn>2</mml:mn><mml:mi>p</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> exactly.</p></fn> For distorted elements, the Jacobian of the transformation describing the distortion of the parent element renders the integrand of, e.g., the element stiffness matrix non-polynomial. Therefore, the integrals generally are not integrated exactly using Gauss-Legendre quadrature, but the accuracy depends, roughly speaking, on the degree of distortion.</p>
<sec id="s10_1"><label>10.1</label>
<title>Two methods of quadrature, 1-D example</title>
<p>To motivate their approaches, the authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] presented an illustrative 1-D example of a simple integral, which was analytically integrated as</p>
<p><disp-formula id="eqn-397"><label>(397)</label><mml:math id="mml-eqn-397" display="block"><mml:mrow><mml:mstyle displaystyle='true'><mml:mrow><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>1</mml:mn></mml:msubsup><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mstyle><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msubsup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:mfrac><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>1</mml:mn></mml:msubsup><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mfrac><mml:mn>2</mml:mn><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Using 2 integration points, Gauss-Legendre quadrature yields significant error</p>
<p><disp-formula id="eqn-398"><label>(398)</label><mml:math id="mml-eqn-398" display="block"><mml:mrow><mml:mstyle displaystyle='true'><mml:mrow><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>1</mml:mn></mml:msubsup><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mstyle><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:mn>3</mml:mn></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:mn>3</mml:mn></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>2</mml:mn><mml:mn>243</mml:mn></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>which is owed to the insufficient number of integration points.</p>
<p>Method 1 to improve accuracy, which is reflected in Application 1.1 (see Section <xref ref-type="sec" rid="s10_2">10.2</xref>), is to increase the number of integration points. In the above example, 6 integration points are required to obtain the exact value of the integral. By increasing the accuracy, however, we sacrifice computational efficiency due the need for 6 evaluations of the integrand instead of the original 2 evaluations.</p>
<p>Method 2 is to retain 2 integration points, and to adjust the quadrature weights at the integration points instead. If the same quadrature weights of <inline-formula id="ieqn-2084"><mml:math id="mml-ieqn-2084"><mml:mn>243</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>11</mml:mn></mml:math></inline-formula> are used instead of 1, the integral evaluates to the exact result:</p>
<p><disp-formula id="eqn-399"><label>(399)</label><mml:math id="mml-eqn-399" display="block"><mml:mrow><mml:mstyle displaystyle='true'><mml:mrow><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>1</mml:mn></mml:msubsup><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mstyle><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>243</mml:mn><mml:mn>11</mml:mn></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:mn>3</mml:mn></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mfrac><mml:mn>243</mml:mn><mml:mn>11</mml:mn></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:mn>3</mml:mn></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>2</mml:mn><mml:mn>11</mml:mn></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>By adjusting the quadrature weights rather than the number of integration points, which is the key concept of Application 1.2 (see Section <xref ref-type="sec" rid="s10_3">10.3</xref>), the computational efficiency of the original approach Eq. (<xref ref-type="disp-formula" rid="eqn-398">398</xref>) is retained.</p>
<p>In this study [<xref ref-type="bibr" rid="ref-38">38</xref>], hexahedral elements with linear shape functions were considered. To exactly integrate the element stiffness matrix of an undistorted element <inline-formula id="ieqn-2086"><mml:math id="mml-ieqn-2086"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mo>=</mml:mo><mml:mn>8</mml:mn></mml:math></inline-formula> integration points are required.<xref ref-type="fn" rid="fn258"><sup>258</sup></xref><fn id="fn258"><label>258</label><p>Conventional linear hexahedra are known to suffer from locking, see, e.g., [<xref ref-type="bibr" rid="ref-39">39</xref>] Section &#x201C;10.3.2 Locking,&#x201D; which can be alleviated by &#x201C;reduced integration,&#x201D; i.e., using a single integration point (<inline-formula id="ieqn-3307"><mml:math id="mml-ieqn-3307"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>). The concept in [<xref ref-type="bibr" rid="ref-38">38</xref>], however, immediately translates to higher-order shape functions and non-conventional finite element formulations (e.g., mixed formulation).</p></fn> Both applications summarized subsequently required the integrand, i.e., the element shape to be identified in an unique way. Gauss-Legendre quadrature is performed in the local coordinates of the reference element with accuracy invariant to both rigid-body motion and uniform stretching of the actual elements. To account for these invariances, a &#x201C;normalization&#x201D; of elements was proposed in [<xref ref-type="bibr" rid="ref-38">38</xref>] (Figure <xref ref-type="fig" rid="fig-96">96</xref>), i.e., hexahedra were re-located to the origin of a frame of reference, re-oriented along with the coordinate planes and scaled by means of the average length of two of its edges.</p>
<p>To train the neural networks involved in their approaches, a large set of distorted elements was created by randomly displacing seven nodes of a regular cube [<xref ref-type="bibr" rid="ref-38">38</xref>],</p>
<p><disp-formula id="eqn-400"><label>(400)</label><mml:math id="mml-eqn-400" display="block"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>B</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;&#x2009;&#x2009;&#x2009;</mml:mtext><mml:mi>C</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;&#x2009;&#x2009;&#x2009;</mml:mtext><mml:mi>D</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>5</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>6</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;&#x2009;&#x2009;&#x2009;</mml:mtext><mml:mi>E</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>7</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>8</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>9</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x0009;</mml:mtext></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;&#x2009;&#x2009;&#x2009;</mml:mtext><mml:mi>G</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>13</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>14</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>15</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;&#x2009;&#x2009;&#x2009;</mml:mtext><mml:mi>H</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>16</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>17</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>18</mml:mn></mml:mrow></mml:msub><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x0009;</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2087"><mml:math id="mml-ieqn-2087"><mml:msub><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2088"><mml:math id="mml-ieqn-2088"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>18</mml:mn></mml:math></inline-formula> were 18 random numbers, see Figure <xref ref-type="fig" rid="fig-97">97</xref>. The elements were collected into five groups according to five different degrees of maximum distortion (maximum <italic>possible</italic> nodal displacement) <inline-formula id="ieqn-2089"><mml:math id="mml-ieqn-2089"><mml:mi>d</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mn>0.3</mml:mn><mml:mo>,</mml:mo><mml:mn>0.4</mml:mn><mml:mo>,</mml:mo><mml:mn>0.5</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. Elements in the set <inline-formula id="ieqn-2090"><mml:math id="mml-ieqn-2090"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula> would only be highly distorted with <inline-formula id="ieqn-2091"><mml:math id="mml-ieqn-2091"><mml:msub><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> having values closer to 1, but may only be slightly distorted with <inline-formula id="ieqn-2092"><mml:math id="mml-ieqn-2092"><mml:msub><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> having values closer to 0.<xref ref-type="fn" rid="fn259"><sup>259</sup></xref><fn id="fn259"><label>259</label><p>In fact, a somewhat ambiguous description of the random generation of elements was provided in [<xref ref-type="bibr" rid="ref-38">38</xref>]. On the one hand, the authors stated that a <italic>&#x201C;&#x2026;coordinates of nodes are changed using a uniform random number <inline-formula id="ieqn-3308"><mml:math id="mml-ieqn-3308"><mml:mi>r</mml:mi></mml:math></inline-formula> &#x2026;,&#x201D;</italic> and did not distinguish the random numbers in Eq. (<xref ref-type="disp-formula" rid="eqn-400">400</xref>) by subscripts. On the other hand, they noted that exaggerated distortion may occur if nodal coordinates of an element were changed independently, and introduced the constraints on the distortion mentioned above in that context. If the same random number <inline-formula id="ieqn-3309"><mml:math id="mml-ieqn-3309"><mml:mi>r</mml:mi></mml:math></inline-formula> were used for all nodal coordinates, all elements generated would exhibit the same mode of distortion.</p></fn> To avoid large distortion, the displacement of each node was restricted to the range of 0.5 to 2, and the angle between adjacent faces must lie within the range of 90&#x00B0; to 120&#x00B0;. Applying the normalization procedure, an element was characterized by a total of 18 nodal coordinates randomly distributed, but in a specific manner according to Eq. (<xref ref-type="disp-formula" rid="eqn-400">400</xref>).</p>
<fig id="fig-96">
<label>Figure 96</label>
<caption><title><italic>Normalization procedure for hexahedra</italic> (Section <xref ref-type="sec" rid="s10_1">10.1</xref>). Numerical integration is performed in local element coordinates. The accuracy of Gauss-Legendre quadrature only depends on the element shape, i.e., it is invariant with respect to rigid-body motion and uniform stretching deformation. For this reason, a normalization procedure was introduced involving one translation and three rotations (second from left to right) for linear hexahedral elements, whose nodes are labelled as shown for the regular hexahedron (left) [<xref ref-type="bibr" rid="ref-38">38</xref>]. (1) The element is displaced such that node <italic>A</italic> coincides with the origin <italic>O</italic> of the global frame. (2) The hexahedron is rotated about the <italic>z</italic>-axis of the global frame to place node B in the <italic>xz</italic>-plane, and then (3) rotated about the <italic>y</italic>-axis such that node <italic>B</italic> lies on the <italic>x</italic>-axis. (4) A third rotation about the <italic>x</italic>-axis relocates node <italic>D</italic> to the <italic>xy</italic>-plane of the global frame. (5) Finally (not shown), the element is scaled by a factor of <inline-formula id="ieqn-766"><mml:math id="mml-ieqn-766"><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>l</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-767"><mml:math id="mml-ieqn-767"><mml:msub><mml:mi>l</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mi>B</mml:mi><mml:mo>+</mml:mo><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:math></inline-formula>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-96.tif"/>
</fig>
<p>To quantify the quadrature error, the authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] introduced <inline-formula id="ieqn-2093"><mml:math id="mml-ieqn-2093"><mml:mi>e</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, a relative measure of accuracy for the components of the stiffness matrix <inline-formula id="ieqn-2094"><mml:math id="mml-ieqn-2094"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as a function of the number integration points <inline-formula id="ieqn-2095"><mml:math id="mml-ieqn-2095"><mml:mi>q</mml:mi></mml:math></inline-formula> along each local coordinate,<xref ref-type="fn" rid="fn260"><sup>260</sup></xref><fn id="fn260"><label>260</label><p>For example, for a <inline-formula id="ieqn-3310"><mml:math id="mml-ieqn-3310"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> integration with a total of <inline-formula id="ieqn-3311"></inline-formula> integration points, <inline-formula id="ieqn-3312"><mml:math id="mml-ieqn-3312"><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>.</p></fn> defined as</p>
<p><disp-formula id="eqn-401"><label>(401)</label><mml:math id="mml-eqn-401" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>e</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2096"><mml:math id="mml-ieqn-2096"><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the component in the <inline-formula id="ieqn-2097"><mml:math id="mml-ieqn-2097"><mml:mi>i</mml:mi></mml:math></inline-formula>-th row and <inline-formula id="ieqn-2098"><mml:math id="mml-ieqn-2098"><mml:mi>j</mml:mi></mml:math></inline-formula>-th column of the stiffness matrix obtained with <inline-formula id="ieqn-2099"><mml:math id="mml-ieqn-2099"><mml:mi>q</mml:mi></mml:math></inline-formula> integration points. The error <inline-formula id="ieqn-2100"><mml:math id="mml-ieqn-2100"><mml:mi>e</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is measured with respect to reference values <inline-formula id="ieqn-2101"><mml:math id="mml-ieqn-2101"><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula>, which are obtained using the Gauss-Legendre quadrature with <inline-formula id="ieqn-2102"><mml:math id="mml-ieqn-2102"><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>30</mml:mn></mml:math></inline-formula> integration points, i.e., a total of <inline-formula id="ieqn-2103"><mml:math id="mml-ieqn-2103"><mml:mrow><mml:msup><mml:mrow><mml:mn>30</mml:mn></mml:mrow><mml:mn>3</mml:mn></mml:msup><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mn>27</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> integration points for 3-D elements.</p>
</sec>
<sec id="s10_2"><label>10.2</label>
<title>Application 1.1: Method 1, Optimal number of integration points</title>
<p>The details of this particular deep-learning application, mentioned briefly in Section <xref ref-type="sec" rid="s2_3">2.3</xref> on motivation via applications of deep learning&#x2013;specifically Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>, item (1)&#x2013;are provided here. The idea is to have a neural network predict, for each element (particularly distorted elements), the number of integration points that provides accurate integration within a given error tolerance <inline-formula id="ieqn-2104"><mml:math id="mml-ieqn-2104"><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, [<xref ref-type="bibr" rid="ref-38">38</xref>].</p>
<sec id="s10_2_1"><label>10.2.1</label>
<title>Method 1, feasibility study</title>
<p>In the example in [<xref ref-type="bibr" rid="ref-38">38</xref>], the quadrature error is required to be smaller than <inline-formula id="ieqn-2105"><mml:math id="mml-ieqn-2105"><mml:mrow> <mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mtext>tol</mml:mtext></mml:mrow></mml:msup><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mn>1</mml:mn><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo>&#x00D7;</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:msup><mml:mrow><mml:mn>10</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. For this purpose, a fully-connected <xref ref-type="sec" rid="s4">feed-forward neural network</xref> with <inline-formula id="ieqn-2106"><mml:math id="mml-ieqn-2106"><mml:mi>N</mml:mi></mml:math></inline-formula> hidden layers of 50 neurons each was used. The non-trivial nodal coordinates were fed as inputs to the network, i.e., <inline-formula id="ieqn-2108"><mml:math id="mml-ieqn-2108"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>18</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. This neural network performed a classification task,<xref ref-type="fn" rid="fn261"><sup>261</sup></xref><fn id="fn261"><label>261</label><p>The authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] used the squared-error loss function (Section <xref ref-type="sec" rid="s5_1_1">5.1.1</xref>) for the classification task, for which the softmax loss function can also be used, as discussed in Section <xref ref-type="sec" rid="s5_1_3">5.1.3</xref>.</p></fn> where each class corresponded to the minimum number of integration points <inline-formula id="ieqn-2109"><mml:math id="mml-ieqn-2109"><mml:mi>q</mml:mi></mml:math></inline-formula> along a local coordinate axis for a maximum error <inline-formula id="ieqn-2110"><mml:math id="mml-ieqn-2110"><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. Figure <xref ref-type="fig" rid="fig-98">98</xref> presents the distribution of 10,000 elements generated randomly for two degrees of maximum possible distortion, <inline-formula id="ieqn-2111"><mml:math id="mml-ieqn-2111"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-2112"><mml:math id="mml-ieqn-2112"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>, using the method of Figure <xref ref-type="fig" rid="fig-97">97</xref>, and classified the minimum number of integration points. Similar results for <inline-formula id="ieqn-2113"><mml:math id="mml-ieqn-2113"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>0.3</mml:mn></mml:math></inline-formula> were presented in [<xref ref-type="bibr" rid="ref-38">38</xref>], in which the conclusion that &#x201C;as the shape is distorted, more integration points are needed&#x201D; is expected.</p>
<fig id="fig-97">
<label>Figure 97</label>
<caption><title><italic>Creation of randomly distorted elements</italic> (Section <xref ref-type="sec" rid="s10">10</xref>). Hexahedra forming the training and validation sets are created by randomly displacing the nodes of a regular hexahedral. To comply with the normalization procedure, node A remains fixed, node B is shifted along the <inline-formula id="ieqn-1940"><mml:math id="mml-ieqn-1940"><mml:mi>x</mml:mi></mml:math></inline-formula>-axis and node C is displaced with the <inline-formula id="ieqn-1941"><mml:math id="mml-ieqn-1941"><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:math></inline-formula>-plane. For each of the remaining nodes (E, F, G, H), all three nodal coordinates are varied randomly. The elements are grouped according to the maximum <italic>possible</italic> nodal displacement <inline-formula id="ieqn-1942"><mml:math id="mml-ieqn-1942"><mml:mi>d</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mn>0.3</mml:mn><mml:mo>,</mml:mo><mml:mn>0.4</mml:mn><mml:mo>,</mml:mo><mml:mn>0.5</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, from which each of the 18 nodal displacements was obtained upon multiplication with a random number <inline-formula id="ieqn-1943"><mml:math id="mml-ieqn-1943"><mml:msub><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-1944"><mml:math id="mml-ieqn-1944"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>18</mml:mn></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-38">38</xref>], see Eq. (<xref ref-type="disp-formula" rid="eqn-400">400</xref>). Each of the 8 nodal zones in which the corresponding node can be placed randomly is shown in red; all nodal zones are cubes, except for Node B (an interval on the <inline-formula id="ieqn-1945"><mml:math id="mml-ieqn-1945"><mml:mi>x</mml:mi></mml:math></inline-formula> axis) and for Node C (a square in the plane <inline-formula id="ieqn-1946"><mml:math id="mml-ieqn-1946"><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-97.tif"/>
</fig>
</sec>
<sec id="s10_2_2"><label>10.2.2</label>
<title>Method 1, training phase</title>
<p>To train the network, 2000 randomly distorted elements were generated for each of the five degrees of maximum distortion, <inline-formula id="ieqn-2114"><mml:math id="mml-ieqn-2114"><mml:mi>d</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mn>0.3</mml:mn><mml:mo>,</mml:mo><mml:mn>0.4</mml:mn><mml:mo>,</mml:mo><mml:mn>0.5</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, to obtain a total of 10000 elements [<xref ref-type="bibr" rid="ref-38">38</xref>]. Before the network was trained, the minimum number of integration points <inline-formula id="ieqn-2115"><mml:math id="mml-ieqn-2115"><mml:msup><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">p</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> to meet the accuracy tolerance <inline-formula id="ieqn-2116"><mml:math id="mml-ieqn-2116"><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> was determined for each element.</p>
<p>The whole dataset was partitioned<xref ref-type="fn" rid="fn262"><sup>262</sup></xref><fn id="fn262"><label>262</label><p>There was no indication in [<xref ref-type="bibr" rid="ref-38">38</xref>] on how these 5,000 elements were selected from the total of 10,000 elements, perhaps randomly.</p></fn> into a training set <inline-formula id="ieqn-2117"><mml:math id="mml-ieqn-2117"><mml:mrow><mml:mi mathvariant="double-struck">X</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2118"><mml:math id="mml-ieqn-2118"><mml:mrow><mml:mi mathvariant="double-struck">Y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> of <inline-formula id="ieqn-2119"><mml:math id="mml-ieqn-2119"><mml:mtext>M</mml:mtext><mml:mo>=</mml:mo><mml:mrow><mml:mn>5000</mml:mn></mml:mrow></mml:math></inline-formula> elements and an equally large validation set,<xref ref-type="fn" rid="fn263"><sup>263</sup></xref><fn id="fn263"><label>263</label><p>In contrast to the terminology of the present paper, in which &#x201C;training set&#x201D; and &#x201C;validation set&#x201D; are used (Section <xref ref-type="sec" rid="s6_1">6.1</xref>), the terms &#x201C;training patterns&#x201D; and &#x201C;test patterns&#x201D;, respectively, were used in [<xref ref-type="bibr" rid="ref-38">38</xref>]. The &#x201C;test patterns&#x201D; were used in the training process, since the authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] <italic>&#x201C;&#x2026;terminated the training before the error for test patterns started to increase&#x201D;</italic> (p.331). These &#x201C;test patterns&#x201D;, based on their use as stated, correspond to the elements of the <italic>validation set</italic> in the present paper, Figure <xref ref-type="fig" rid="fig-100">100</xref>. Technically, there was no test set in [<xref ref-type="bibr" rid="ref-38">38</xref>]; see Section <xref ref-type="sec" rid="s6_1">6.1</xref>.</p></fn> by which the training progress was monitored [<xref ref-type="bibr" rid="ref-38">38</xref>], as described in Section <xref ref-type="sec" rid="s6_1">6.1</xref>. The authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] explored several network architectures (with depth ranging from 1 to 5, keeping the width fixed at 50) to determine the optimal structure of their classifier network, which used the <xref ref-type="sec" rid="s5_3_1">logistic sigmoid</xref> as activation function.<xref ref-type="fn" rid="fn264"><sup>264</sup></xref><fn id="fn264"><label>264</label><p>The first author of [<xref ref-type="bibr" rid="ref-38">38</xref>] provided the information on the activation function through a private communication to the authors on 2018 Nov 16. Their tests with the <xref ref-type="sec" rid="s5_3_2">ReLU</xref> did not show improved performance (in terms of accuracy) as compared to the logistic sigmoid.</p></fn><sup>,</sup><xref ref-type="fn" rid="fn265"><sup>265</sup></xref><fn id="fn265"><label>265</label><p>Even though the squared-error loss function (Section <xref ref-type="sec" rid="s5_1_1">5.1.1</xref>) was used in [<xref ref-type="bibr" rid="ref-38">38</xref>], we also discuss the softmax loss function for classification tasks; see Section <xref ref-type="sec" rid="s5_1_3">5.1.3</xref>.</p></fn> Their optimal feed-forward network, composed of 3 hidden layers of 50 neurons each, correctly predicted the number of integration points needed for 98.6% of the elements in the training set, and for 81.6 % of the elements in the validation set, Figure <xref ref-type="fig" rid="fig-99">99</xref>.</p>
<fig id="fig-98">
<label>Figure 98</label>
<caption><title><italic>Method 1, Optimal number of integration points, feasibility</italic> (Section 10.2.1). Distribution of minimum numbers of integration points on a local coordinate axes for a maximum error of e<sup>tol</sup> = 10<sup>&#x2212;3</sup> among 10,000 elements generated randomly using the method in Figure 97. For <italic>d</italic> = 0:1, all elements were only slightly distorted, and required 3 integration points each. For <italic>d</italic> = 0:5, close to 5,000 elements required 4 integration points each; very few elements required 3 or 10 integration points [38]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-98.tif"/>
</fig>
</sec>
<sec id="s10_2_3"><label>10.2.3</label>
<title>Method 1, application phase</title>
<p>The correct number of quadrature points and the corresponding number of points predicted by the neural network are illustrated in Figure <xref ref-type="fig" rid="fig-100">100</xref> for both the training set (&#x201C;patterns&#x201D;) in Table (a) and the validation set (&#x201C;test patterns&#x201D;) in Table (b).</p>
<p>Both the training set and the validation set, each had 5000 distorted element shapes. As an example to interpret these tables, take Table (a), Row 2 (red underline, labeled &#x201C;3&#x201D; in red circle) of the matrix (blue box): Out of a Total of 1562 element shapes (last column) in the training set that were ideally integrated using 3 quadrature points (in red circle), the neural network correctly estimated a need of 3 quadrature points (Column 2, labeled &#x201C;3&#x201D; in red circle) for 1553 element shapes, and 4 quadrature points (Column 3, labeled &#x201C;4&#x201D; in red circle) for 9 element shapes. That&#x2019;s an accuracy of 99.4% for Row 2. The accuracy varies, however, by row, 0% for Row 1 (2 integration points), 99.6% for Row 3,..., 71.4% for Row 7, 0% for Row 8 (9 integration points). The numbers in column &#x201C;Total&#x201D; add up to 5 + 1562 + 2557 + 636 + 162 + 55 + 21 + 2 = 5000 elements in the training set. The diagonal coefficients add up to 1553 + 2548 + 616 + 153 + 46 + 15 = 4931 elements with correctly predicted number of integration points, yielding the overall accuracy of 4931 / 5000 = 98.6% in training, Figure <xref ref-type="fig" rid="fig-99">99</xref>.</p>
<fig id="fig-99">
<label>Figure 99</label>
<caption><title><italic>Method 1, Optimal network architecture for training</italic> (Section <xref ref-type="sec" rid="s10_2_2">10.2.2</xref>). The number of hidden layers varies from 1 to 5, keeping the number of neurons per hidden layer constant at 50. The network with 3 hidden layers provided the highest accuracy for both the training set (&#x201C;patterns&#x201D;) at 98.6% and for the validation set (&#x201C;test patterns&#x201D;) at 81.6%. Increase the network depth does not necessarily increase the accuracy [<xref ref-type="bibr" rid="ref-38">38</xref>]. (Table reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-99.tif"/>
</fig>
<p>For Table (b) in Figure <xref ref-type="fig" rid="fig-100">100</xref>, the numbers in column &#x201C;Total&#x201D; add up to 5 + 1553 + 2574 + 656 + 135 + 56 + 21 + 5 = 5005 elements in the validation set, which should have 5000, as written by [<xref ref-type="bibr" rid="ref-38">38</xref>]. Was there a misprint ? The diagonal coefficients add up to 1430 + 2222 + 386 + 36 + 6 = 4080 elements with correctly predicted number of integration points, yielding the accuracy of 4080 / 5000 = 81.6%, which agrees with Row 3 in Figure <xref ref-type="fig" rid="fig-99">99</xref>. As a result of this agreement, the number of elements in the validation set (&#x201C;test patterns&#x201D;) should be 5000, and not 5005, i.e., there was a misprint in column &#x201C;Total&#x201D;.</p>
</sec></sec>
<sec id="s10_3"><label>10.3</label>
<title>Application 1.2: Method 2, optimal quadrature weights</title>
<p>The details of this particular deep-learning application, mentioned briefly in Section <xref ref-type="sec" rid="s2_3">2.3</xref> on motivation via applications of deep learning, Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>, item (2), are provided here. As an alternative to increasing the number of quadrature points, the authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] proposed to compensate for the quadrature error introduced by the element distortion by adjusting the quadrature weights at a fixed number of quadrature points. For this purpose, they introduced correction factors <inline-formula id="ieqn-2120"><mml:math id="mml-ieqn-2120"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> that were predicted by a neural network, and were multipliers for the corresponding standard quadrature weights <inline-formula id="ieqn-2121"><mml:math id="mml-ieqn-2121"><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> of an undistorted hexahedron. To exactly compute the components of the stiffness matrix, undistorted linear hexahedra require eight quadrature points (<inline-formula id="ieqn-2122"><mml:math id="mml-ieqn-2122"><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:math></inline-formula>) at local positions <inline-formula id="ieqn-2123"><mml:math id="mml-ieqn-2123"><mml:mi>&#x03BE;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B7;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B6;</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x00B1;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msqrt><mml:mn>3</mml:mn></mml:msqrt></mml:math></inline-formula> with uniform weights <inline-formula id="ieqn-2124"><mml:math id="mml-ieqn-2124"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>.</p>
<p>The data preparation here was similar to that in Method 1, Section <xref ref-type="sec" rid="s10_2">10.2</xref>, except that 20,000 randomly distorted elements were generated, with 4000 elements in each of the five groups, each group having a different degree of maximum distortion <inline-formula id="ieqn-2125"><mml:math id="mml-ieqn-2125"><mml:mi>d</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mn>0.3</mml:mn><mml:mo>,</mml:mo><mml:mn>0.4</mml:mn><mml:mo>,</mml:mo><mml:mn>0.5</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, as depicted in Figure <xref ref-type="fig" rid="fig-97">97</xref> [<xref ref-type="bibr" rid="ref-38">38</xref>].</p>
<sec id="s10_3_1"><label>10.3.1</label>
<title>Method 2, feasibility study</title>
<p>Using the above 20,000 elements, the feasibility of improving integration accuracy by quadrature weight correction was established in Figure <xref ref-type="fig" rid="fig-101">101</xref>. To obtain these results, a brute-force search was used: For each of the 20000 elements, 1 million sets of random correction factors <inline-formula id="ieqn-2126"><mml:math id="mml-ieqn-2126" display="block"><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mn>0.95</mml:mn><mml:mo>,</mml:mo><mml:mn>1.05</mml:mn></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> were tested to find the optimal set of correction factors for that single element [<xref ref-type="bibr" rid="ref-38">38</xref>]. The effectiveness of quadrature weight correction was quantified by the error reduction ratio defined by</p>
<p><disp-formula id="eqn-402"><label>(402)</label><mml:math id="mml-eqn-402" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>i.e., the ratio between the quadrature error defined in Eq. (<xref ref-type="disp-formula" rid="eqn-401">401</xref>) obtained using the <italic>optimal</italic> (&#x201C;<italic>opt</italic>&#x201D;) corrected quadrature weights and the quadrature error obtained using the standard quadrature weights of Gauss-Legendre quadrature. Accordingly, a ratio <inline-formula id="ieqn-2127"><mml:math id="mml-ieqn-2127"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> indicates that the modified quadrature weights improved the accuracy. For each element, the optimal correction factors that yielded the smallest error ratio, i.e.,</p>
<p><disp-formula id="eqn-403"><label>(403)</label><mml:math id="mml-eqn-403" display="block"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mi>arg</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mtext>error</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula></p>
<fig id="fig-100">
<label>Figure 100</label>
<caption><title><italic>Method 1, application phase</italic> (Section <xref ref-type="sec" rid="s10_2_3">10.2.3</xref>). The numbers of quadrature points predicted by the neural network was compared to the minimum numbers of quadrature points for maximum error <inline-formula id="ieqn-1947"><mml:math id="mml-ieqn-1947"><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-38">38</xref>]. Table (a) shows the results for the training set (&#x201C;patterns&#x201D;), and Table (b) for the validation set. (Table reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-100.tif"/>
</fig>
<p>were retained as target values for training, and identified with the superscript &#x201C;<italic>opt</italic>&#x201D;, standing for &#x201C;optimal&#x201D;. The corresponding optimally integrated coefficients in the element stiffness matrix are denoted by <inline-formula id="ieqn-2128"><mml:math id="mml-ieqn-2128"><mml:msubsup><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, with <inline-formula id="ieqn-2129"><mml:math id="mml-ieqn-2129"><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo></mml:math></inline-formula>, as appeared in Eq. (<xref ref-type="disp-formula" rid="eqn-402">402</xref>).</p>
<p>It turns out that a reduction of the quadrature error by correcting the quadrature weights is not feasible for all element shapes. Undistorted elements, for instance, which are already integrated exactly using standard quadrature weights, naturally do not admit improvements. These 20,000 elements were classified into two categories A and B [<xref ref-type="bibr" rid="ref-38">38</xref>], Figure <xref ref-type="fig" rid="fig-101">101</xref>. Quadrature weight correction was not effective for Category A (<inline-formula id="ieqn-2130"><mml:math id="mml-ieqn-2130"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2265;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>), but effective for Category B (<inline-formula id="ieqn-2131"><mml:math id="mml-ieqn-2131"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>). Figure <xref ref-type="fig" rid="fig-101">101</xref> shows that elements with a higher degree of maximum distortion were more likely to benefit from the quadrature weight correction as compared to weakly distorted elements. Recall that among the elements of the group <inline-formula id="ieqn-2132"><mml:math id="mml-ieqn-2132"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula> were elements that were only slightly distorted (due to the random factors <inline-formula id="ieqn-2133"><mml:math id="mml-ieqn-2133"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-400">400</xref>) being close to zero), and therefore would not benefit from quadrature weight correction (Category A); there were 1489 such elements, Figure <xref ref-type="fig" rid="fig-101">101</xref>.</p>
</sec>
<sec id="s10_3_2"><label>10.3.2</label>
<title>Method 2, training phase</title>
<p>Because the effectiveness of the quadrature weight correction strongly depends on the degree of maximum distortion, <inline-formula id="ieqn-2134"><mml:math id="mml-ieqn-2134"><mml:mi>d</mml:mi></mml:math></inline-formula>, the authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] proposed a two-stage approach for the correction of quadrature weights, which relied on two fully-connected feedforward neural networks.</p>
<p>In the first stage, a <italic>first</italic> neural network, a binary classifier, was trained to predict whether an element shape admits improved accuracy by quadrature weight correction (Category B) or not (Category A). The neural network to perform the classification task took the 18 non-trivial nodal coordinates obtained upon the proposed normalization procedure for linear hexahedra as inputs, i.e., <inline-formula id="ieqn-2135"><mml:math id="mml-ieqn-2135"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>18</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. The output of the neural network was single scalar <inline-formula id="ieqn-2136"><mml:math id="mml-ieqn-2136"><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-2137"><mml:math id="mml-ieqn-2137"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> indicated an element of Category A and <inline-formula id="ieqn-2138"><mml:math id="mml-ieqn-2138"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> an element of Category B.</p>
<fig id="fig-101">
<label>Figure 101</label>
<caption><title><italic>Method 2, Quadrature weight correction, feasibility</italic> (Section <xref ref-type="sec" rid="s10_3_1">10.3.1</xref>). Each element was tested 1 million times with randomly generated sets of quadrature weights. There were 4000 elements in each of the 5 groups with different degrees of maximum distortion, <inline-formula id="ieqn-1948"><mml:math id="mml-ieqn-1948"><mml:mi>d</mml:mi></mml:math></inline-formula>. Quadrature weight correction effectiveness increased with element distortion. Weakly distorted elements (<inline-formula id="ieqn-1949"><mml:math id="mml-ieqn-1949"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula>) did not have any improvement, and thus belonged to Category A (error ratio <inline-formula id="ieqn-1950"><mml:math id="mml-ieqn-1950"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2265;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>). As <inline-formula id="ieqn-1951"><mml:math id="mml-ieqn-1951"><mml:mi>d</mml:mi></mml:math></inline-formula> increased, the size of Category A decreased, while the size of Category B increased. Among the 4000 elements in the group with <inline-formula id="ieqn-1952"><mml:math id="mml-ieqn-1952"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>, the stiffness matrices of 2511 elements could be integrated more accurately by correcting their quadrature weights (Category <inline-formula id="ieqn-1953"><mml:math id="mml-ieqn-1953"><mml:mi>B</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-1954"><mml:math id="mml-ieqn-1954"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>) [<xref ref-type="bibr" rid="ref-38">38</xref>]. (Table reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-101.tif"/>
</fig>
<p>Out of the 20000 elements generated, 10000 elements were selected to train the classifier network, for which both the training set and the validation set comprised 5000 elements each [<xref ref-type="bibr" rid="ref-38">38</xref>].<xref ref-type="fn" rid="fn266"><sup>266</sup></xref><fn id="fn266"><label>266</label><p>Even though a reason for not using the entire set of 20000 elements was not given, it could be guessed that the authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] would want the size of the training set and of the validation set to be the same as that in Method 1, Section <xref ref-type="sec" rid="s10_2">10.2</xref>. Moreover, even though details were not given, the selection of these 10,000 elements would likely be a random process.</p></fn> The optimal neural network in terms of classification accuracy for this application had 4 hidden layers with 30 neurons per layer; see Figure <xref ref-type="fig" rid="fig-102">102</xref>. The trained neural network succeeded in predicting the correct category for 98 %, &amp; 92 % of the elements in the training set and in the validation set, respectively.</p>
<p>In the second stage, a <italic>second</italic> neural network was trained to predict the corrections to the quadrature weights for all those elements, which allowed a reduction of the quadrature error. Again, the 18 non-trivial nodal coordinates of a normalized hexahedron were input to the neural network, i.e., <inline-formula id="ieqn-2139"><mml:math id="mml-ieqn-2139"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>18</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. The outputs of the neural network <inline-formula id="ieqn-2140"><mml:math id="mml-ieqn-2140"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>8</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> represented the eight correction factors to the standard weights <inline-formula id="ieqn-2141"><mml:math id="mml-ieqn-2141"><mml:mrow><mml:mrow><mml:mo>{</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-2142"><mml:math id="mml-ieqn-2142"><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>, of the Gauss-Legendre quadrature. The authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] stated that 10000 elements with an error reduction ration <inline-formula id="ieqn-2143"><mml:math id="mml-ieqn-2143"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> formed equally large training set and validation set, comprising 5000 elements each.<xref ref-type="fn" rid="fn267"><sup>267</sup></xref><fn id="fn267"><label>267</label><p>According to Figure <xref ref-type="fig" rid="fig-101">101</xref>, only 4868 out of in total 20000 elements generated belonged to Category B, for which <inline-formula id="ieqn-3313"><mml:math id="mml-ieqn-3313"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&lt;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> held. Further details on the 10000 elements that were being used for training and validation were not proviced in [<xref ref-type="bibr" rid="ref-38">38</xref>].</p></fn> A neural network with 5 hidden layers of 50 neurons was reported to perform best in predicting the corrections to the quadrature weights. The normalized error<xref ref-type="fn" rid="fn268"><sup>268</sup></xref><fn id="fn268"><label>268</label><p>No details on the normalization procedure were provided in the paper.</p></fn> distribution of the correction factors are illustrated in Figure <xref ref-type="fig" rid="fig-103">103</xref>.</p>
</sec>
<sec id="s10_3_3"><label>10.3.3</label>
<title>Method 2, application phase</title>
<p>The effectiveness of the numerical quadrature with corrected quadrature weights was already presented in Figure <xref ref-type="fig" rid="fig-10">10</xref> in Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref>, with the distribution of the error-reduction ratio <inline-formula id="ieqn-2144"><mml:math id="mml-ieqn-2144"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> defined in Eq. (<xref ref-type="disp-formula" rid="eqn-402">402</xref>) for the training set (patterns) (a) and the validation set (test patterns) (b). Further explanation is provided here for Figure <xref ref-type="fig" rid="fig-10">10</xref>.<xref ref-type="fn" rid="fn269"><sup>269</sup></xref><fn id="fn269"><label>269</label><p>The caption of Figure <xref ref-type="fig" rid="fig-10">10</xref>, if it were in Section <xref ref-type="sec" rid="s10_3_3">10.3.3</xref>, would begin with <italic>&#x201C;Method 2, application phase&#x201D;</italic> in parallel to the caption of Figure <xref ref-type="fig" rid="fig-100">100</xref>.</p></fn></p>
<fig id="fig-102">
<label>Figure 102</label>
<caption><title><italic>Method 2, training phase, classifier network</italic> (Section <xref ref-type="sec" rid="s10_3_2">10.3.2</xref>). The training and validation sets comprised 5000 elements each, of which 3707 and 3682, respectively, belonged to Category A (no improvements upon weight correction). A <italic>first</italic> neural network with 4 hidden layers of 30 neurons correctly classified <inline-formula id="ieqn-1955"><mml:math id="mml-ieqn-1955"><mml:mo stretchy="false">(</mml:mo><mml:mn>3707</mml:mn><mml:mo>+</mml:mo><mml:mn>1194</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>5000</mml:mn><mml:mo>&#x2248;</mml:mo><mml:mrow><mml:mn>98</mml:mn></mml:mrow><mml:mrow><mml:mtext>&#x0025;</mml:mtext></mml:mrow></mml:math></inline-formula> elements in the training set (a) and <inline-formula id="ieqn-1956"><mml:math id="mml-ieqn-1956"><mml:mo stretchy="false">(</mml:mo><mml:mn>3682</mml:mn><mml:mo>+</mml:mo><mml:mn>939</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>5000</mml:mn><mml:mo>&#x2248;</mml:mo><mml:mrow><mml:mn>92</mml:mn></mml:mrow><mml:mrow><mml:mtext>&#x0025;</mml:mtext></mml:mrow></mml:math></inline-formula> elements in the validation set (b). See also Figure <xref ref-type="fig" rid="fig-10">10</xref> in Section <xref ref-type="sec" rid="s2_3_1">2.3.1</xref> [<xref ref-type="bibr" rid="ref-38">38</xref>]. (Table reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-102.tif"/>
</fig>
<p>The red bars (&#x201C;Optimized&#x201D;) in Figure <xref ref-type="fig" rid="fig-10">10</xref> represent the distribution of the error-reduction ratio <inline-formula id="ieqn-2145"><mml:math id="mml-ieqn-2145"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-402">402</xref>), using the optimal correction factors, which were themselves obtained by a brute force method (Sections <xref ref-type="sec" rid="s10_3_1">10.3.1</xref>-<xref ref-type="sec" rid="s10_3_2">10.3.2</xref>). While these optimal correction factors were used as targets for training (Figure <xref ref-type="fig" rid="fig-10">10</xref> (a)), they were used only to compute the error-reduction ratios <inline-formula id="ieqn-2146"><mml:math id="mml-ieqn-2146"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> for elements in the validation set (Figure <xref ref-type="fig" rid="fig-10">10</xref> (b)).</p>
<p>The blue bars (&#x201C;Estimated by Neuro&#x201D;) correspond to the error reduction ratios achieved with the corrected quadrature weights that were predicted by the trained neural network.</p>
<p>The error-reduction ratios <inline-formula id="ieqn-2147"><mml:math id="mml-ieqn-2147"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> indicate improved quadrature accuracy, which occurred for 99 % of the elements in the training set, and for 97 % of the elements in the validation set. Very few elements had their accuracy worsened (<inline-formula id="ieqn-2148"><mml:math id="mml-ieqn-2148"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003E;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>) when using the predicted quadrature weights.</p>
<p>There were no red bars (optimal weights) with <inline-formula id="ieqn-2149"><mml:math id="mml-ieqn-2149"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003E;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> since only elements that admitted improved quadrature accuracy by optimal quadrature weight correction were included in the training set and validation set. The blue bars with error ratios <inline-formula id="ieqn-2150"><mml:math id="mml-ieqn-2150"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003E;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> corresponded to the distorted hexahedra for which the accuracy worsened as compared to the standard weights of Gauss-Legendre quadrature.</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-38">38</xref>] concluded their paper with a discussion on the computational efforts, and in particular the evaluation of trained neural networks in the application phase. As opposed to computational mechanics, where we are used to double precision floating-point arithmetics, deep neural networks have proven to perform well with reduced numerical precision.<xref ref-type="fn" rid="fn270"><sup>270</sup></xref><fn id="fn270"><label>270</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-294">294</xref>] [<xref ref-type="bibr" rid="ref-295">295</xref>] [<xref ref-type="bibr" rid="ref-296">296</xref>].</p></fn> To speed up the evaluation of the trained networks, the least significant bits from all parameters (weights, biases) and the inputs were simply removed [<xref ref-type="bibr" rid="ref-38">38</xref>]. In both the estimation of the number of quadrature points and the prediction of the weight correction factors, half precision floating-point numbers (16 bit) turn out to show sufficient accuracy almost on par with single precision floats (32 bit).</p>
<fig id="fig-103">
<label>Figure 103</label>
<caption><title><italic>Method 2, training phase, regression network</italic> (Section <xref ref-type="sec" rid="s10_3_2">10.3.2</xref>). A <italic>second</italic> neural network estimated 8 correction factors <inline-formula id="ieqn-1957"><mml:math id="mml-ieqn-1957"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, with <inline-formula id="ieqn-1958"><mml:math id="mml-ieqn-1958"><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, to be multiplied by the standard quadrature weights for each element. Distribution of normalized errors, i.e., the normalized differences between the predicted weights (outputs) <inline-formula id="ieqn-1959"><mml:math id="mml-ieqn-1959"><mml:msub><mml:mi>O</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> and the true weights <inline-formula id="ieqn-1960"><mml:math id="mml-ieqn-1960"><mml:msub><mml:mi>T</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> for the elements of the training set (red) and the test set (blue). For both sets, which consist of <inline-formula id="ieqn-1961"><mml:math id="mml-ieqn-1961"><mml:mn>5000</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>8</mml:mn><mml:mo>=</mml:mo><mml:mrow><mml:mn>40,000</mml:mn></mml:mrow></mml:math></inline-formula> correction factors each, the error has a mean of zero and seems to obey a normal distribution [<xref ref-type="bibr" rid="ref-38">38</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-103.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s11"><label>11</label>
<title>Application 2: Solid mechanics, multi-scale, multi-physics</title>
<p>The results and deep-learning concepts used in [<xref ref-type="bibr" rid="ref-25">25</xref>] were presented in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref> further above. In this section, we discuss some details of the formulation.</p>
<sec id="s11_1"><label>11.1</label>
<title>Multiscale problems</title>
<p>Multiscale problems are characterized by the fact that couplings of physical processes occurring on different scales of length and/or time needs to be considered. In the field of computational mechanics, multiscale models are often used to accurately capture the constitutive behavior on a macroscopic length scale, since resolving the entire domain under consideration on the smallest relevant scale if often intractable. To reduce the computational costs, multiscale techniques as, e.g., coupled DEM-FEM or coupled FEM-FEM (known as FEM<inline-formula id="ieqn-2151"><mml:math id="mml-ieqn-2151"><mml:msup><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>),<xref ref-type="fn" rid="fn271"><sup>271</sup></xref><fn id="fn271"><label>271</label><p>DEM = Discrete Element Method. FEM = Finite Element Method. See references in [<xref ref-type="bibr" rid="ref-25">25</xref>].</p></fn> have been proposed for bridging length scales by deducing constitutive laws from micro-scale models for Representative Volume Elements (RVEs).<xref ref-type="fn" rid="fn272"><sup>272</sup></xref><fn id="fn272"><label>272</label><p>RVEs are also referred to as <italic>representative elementary volumes</italic> (REVs) or, simply, <italic>unit cells</italic>.</p></fn> These approaches no longer require the entire macroscopic domain to be resolved on the micro-scale. In FEM<inline-formula id="ieqn-2152"><mml:math id="mml-ieqn-2152"><mml:msup><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, for instance, the microscale model to deduce the constitutive behavior on the macroscopic scale is evaluated at the quadrature points of the microscale model.</p>
<p>The multiscale problem in the mechanics of porous media tackled in [<xref ref-type="bibr" rid="ref-25">25</xref>] is represented in Figure <xref ref-type="fig" rid="fig-104">104</xref>, where the relative orientations among the three models at microscale, mesoscale, and macroscale in Figure <xref ref-type="fig" rid="fig-14">14</xref> are indicated. The method of analysis (DEM or FEM) in each scale is also indicated in the figure.</p>
</sec>
<sec id="s11_2"><label>11.2</label>
<title>Data-driven constitutive modeling, deep learning</title>
<p>Despite the diverse approaches proposed, multiscale problems remain a computationally challenging task, which was tackled in [<xref ref-type="bibr" rid="ref-25">25</xref>] by means of a hybrid data-driven method by combining deep neural networks and conventional constitutive models. To illustrate the hierarchy among the relations of models and to identify, the authors of [<xref ref-type="bibr" rid="ref-25">25</xref>] used directed graphs, which also indicated the nature of the individual relations by the colors of the graph edges. Black edges correspond to &#x201C;universal principles&#x201D; whereas red edges represent phenomenological relations, see, e.g., the classical problem in solid mechanics shown in Figure <xref ref-type="fig" rid="fig-105">105</xref>. Within classical mechanics, the balance of linear momentum is axiomatic in nature, i.e., it represents a well-accepted premise that is taken to be true. The relation between the displacement field and the strain tensor represents a definition. The constitutive law describing the stress response, which, in the elastic case, is an algebraic relation among stresses and strains, is the only phenomenological part in the &#x201C;single-physics&#x201D; solid mechanics problem and, therefore, highlighted in red.</p>
<fig id="fig-104">
<label>Figure 104</label>
<caption><title><italic>Three scales in data-driven fault-reactivation simulations</italic> (Sections <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, <xref ref-type="sec" rid="s11_1">11.1</xref>, <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). Relative orientation of Representative Volume Elements (RVEs). <italic>Left:</italic> Microscale (<inline-formula id="ieqn-1962"><mml:math id="mml-ieqn-1962"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula>) RVE using Discrete Element Method (DEM), Figure <xref ref-type="fig" rid="fig-106">106</xref> and Row 1 of Figure <xref ref-type="fig" rid="fig-14">14</xref>. <italic>Center:</italic> Mesoscale (cm) REV using FEM; Row 2 of Figure <xref ref-type="fig" rid="fig-14">14</xref>. <italic>Right:</italic> Field-size macroscale (km) FEM model; Row 3 of Figure <xref ref-type="fig" rid="fig-14">14</xref> [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-104.tif"/>
</fig>
<fig id="fig-105">
<label>Figure 105</label>
<caption><title><italic>Single-physics block diagram</italic> (Section <xref ref-type="sec" rid="s11_2">11.2</xref>). Single physics is an easiest way to see the role of deep learning in modeling complex nonlinear constitutive behavior (stress-strain relation, red arrow), as first realized in [<xref ref-type="bibr" rid="ref-23">23</xref>], where balance of linear momentum and strain-displacement relation are definitions or accepted &#x201C;universal principles&#x201D; (black arrows) [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-105.tif"/>
</fig>
<p>In many engineering problems, stress-strain relations of, possibly nonlinear, elasticity, which are parameterized by a set of elastic moduli, can be used in the modeling. For heterogeneous materials as, e.g., in composite structures, even the &#x201C;single physics&#x201D; problem of elastic solid mechanics may necessitate multiscale approaches, in which constitutive laws are replaced by RVE simulations and homogenization. This approach was extended to multiphysics models of porous media, in which multiple scales needed to be considered [<xref ref-type="bibr" rid="ref-25">25</xref>]. The counterpart of Figure <xref ref-type="fig" rid="fig-105">105</xref> for the mechanics of porous media is complex, could be confusing for readers not familiar with the field, does not add much to the understanding of the use of deep learning in this study, and therefore not included here; see [<xref ref-type="bibr" rid="ref-25">25</xref>].</p>
<p>The hybrid approach in [<xref ref-type="bibr" rid="ref-25">25</xref>], which was described as graph-based machine learning model, retained those parts of the model which represented universal principles or definitions (black arrows). Phenomenological relations (red arrows), which, in conventional multiscale approaches, followed from microscale models, were replaced by computationally efficient data-driven models. In view of the path-dependency of the constitutive behavior in the poromechanics problem considered, it was proposed in [<xref ref-type="bibr" rid="ref-25">25</xref>] to use <xref ref-type="sec" rid="s7_1">recurrent neural networks</xref> (RNNs), Section <xref ref-type="sec" rid="s7_1">7.1</xref>, constructed with <xref ref-type="sec" rid="s7_2">Long Short-Term Memory</xref> (LSTM) cells, Section <xref ref-type="sec" rid="s7_2">7.2</xref>.</p>
</sec>
<sec id="s11_3"><label>11.3</label>
<title>Multiscale multiphysics problem: Porous media</title>
<p>The problem of hydro-mechanical coupling in deformable porous media with multiple permeabilities is characterized by the presence of two or more pore systems with different typical sizes or geometrical features of the host matrix [<xref ref-type="bibr" rid="ref-25">25</xref>]. The individual pore systems may exchange fluid depending on whether the pores are connected or not. If the (macroscopic) deformation of the solid skeleton is large, plastic deformation and cracks may occur, which result in anisotropic evolution of the effective permeability. As a consequence, problems of this kind are not characterized by a single effective permeability, and, to identify the material parameters on the macroscopic scale, micro-structural models need to be incorporated.</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-25">25</xref>] considered a saturated porous medium, which features two dominant pore scales: The regular solid matrix was characterized by micropores, whereas macropores may result, e.g., from cracks and fissures. Both the volume and the partial densities of each constituent in the mixture of solid, micropores, macropores and voids were characterized by the porosity, i.e., the (local) ratio of pores and the total volume, as well as the fractions of the respective pore systems.</p>
<sec id="s11_3_1"><label>11.3.1</label>
<title>Recurrent neural networks for scale bridging</title>
<p>Recurrent neural networks (RNNs, Section <xref ref-type="sec" rid="s7_1">7.1</xref>), which are equivalent to &#x201C;very deep feedforward networks&#x201D; (Remark <xref ref-type="statement" rid="st7_2">7.2</xref>), were used in [<xref ref-type="bibr" rid="ref-25">25</xref>] as a scale-bridging method to efficiently simulate multiscale problems of poroplasticity. In Figure <xref ref-type="fig" rid="fig-14">14</xref>, Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, three scales were considered: Microscale (<inline-formula id="ieqn-2153"><mml:math id="mml-ieqn-2153"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula>), mesoscale (cm), macroscale (km).</p>
<p>At the microscale, Discrete Element Method (DEM) is used to simulate a mesoscale Representative Volume Element (RVE) that consists of a cubic pack of microscale particles, with different loading conditions to generate a training set for a mesoscale RNN with LSTM architecture to model mesoscale constitutive response to produce loading histories <inline-formula id="ieqn-2154"><mml:math id="mml-ieqn-2154"><mml:mo>(</mml:mo><mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mo>&#x0027;</mml:mo></mml:msubsup><mml:mo>)</mml:mo></mml:math></inline-formula> on demand.</p>
<p>At the mesocale, Finite Element Method (FEM), combined with mesoscale loading histories <inline-formula id="ieqn-2155"><mml:math id="mml-ieqn-2155"><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2156"><mml:math id="mml-ieqn-2156"><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mo>&#x0027;</mml:mo></mml:msubsup><mml:mo>)</mml:mo></mml:math></inline-formula> produced by the above mesoscale RNN, are used to model a macroscale RVE, subjected to different loading conditions to generate a training set for a macroscale RNN with LSTM architecture to model macroscale constitutive response to produce loading histories <inline-formula id="ieqn-2157"><mml:math id="mml-ieqn-2157"><mml:mo>(</mml:mo><mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mo>&#x0027;</mml:mo></mml:msubsup><mml:mo>)</mml:mo></mml:math></inline-formula> on demand.</p>
<p>At the macroscale, Finite Element Method (FEM), combined with macroscale loading histories <inline-formula id="ieqn-2158"><mml:math id="mml-ieqn-2158"><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></inline-formula> <inline-formula id="ieqn-2159"><mml:math id="mml-ieqn-2159"><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mo>&#x0027;</mml:mo></mml:msubsup><mml:mo>)</mml:mo></mml:math></inline-formula> produced by the above macroscale RNN, are used to model an actual domain at kilometer size (macroscale).</p>
</sec>
<sec id="s11_3_2"><label>11.3.2</label>
<title>Microstructure and principal direction data</title>
<p>To train the mesoscale RNN with LSTM units (which was called the &#x201C;Mesoscale data-driven constitutive model&#x201D; in [<xref ref-type="bibr" rid="ref-25">25</xref>]), incorporating microstucture data&#x2013;such as the fabric tensor <inline-formula id="ieqn-2160"><mml:math id="mml-ieqn-2160"><mml:mrow><mml:mi mathvariant="bold-italic">F</mml:mi></mml:mrow></mml:math></inline-formula> of the 1st kind of rank 2 (motivated in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, Figure <xref ref-type="fig" rid="fig-16">16</xref> and Figure <xref ref-type="fig" rid="fig-17">17</xref>) defined in Eq. (<xref ref-type="disp-formula" rid="eqn-404">404</xref>) [<xref ref-type="bibr" rid="ref-102">102</xref>]&#x2013;into the training set, whose data were generated by a discrete element assembly as in Figure <xref ref-type="fig" rid="fig-106">106</xref> depicting a microscale RVE, was important to obtain good network prediction, as mentioned in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref> in relation to Figure <xref ref-type="fig" rid="fig-17">17</xref>:</p>
<p><disp-formula id="eqn-404"><label>(404)</label><mml:math id="mml-eqn-404" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mo>:=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi>F</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi>F</mml:mi></mml:msub><mml:mo>:=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="bold-italic">n</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">n</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-106">
<label>Figure 106</label>
<caption><title><italic>Microscale RVE</italic> (Sections <xref ref-type="sec" rid="s11_3_2">11.3.2</xref>, <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>, <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). A <inline-formula id="ieqn-1963"><mml:math id="mml-ieqn-1963"><mml:mn>10</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>10</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>5</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow></mml:math></inline-formula> box of identical spheres of <inline-formula id="ieqn-1964"><mml:math id="mml-ieqn-1964"><mml:mn>0.5</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow></mml:math></inline-formula> diameter (Figure <xref ref-type="fig" rid="fig-14">14</xref>, row 1, Remark <xref ref-type="statement" rid="st11_1">11.1</xref>), subjected to imposed displacements. (a) Initial configuration of the granular assembly; (b) Imposed displacement, deformed configuration; (c) Flow network generated from deformed configuration used to predict anisotropic effective permeability [<xref ref-type="bibr" rid="ref-25">25</xref>]. See also Figure <xref ref-type="fig" rid="fig-104">104</xref> and Figure <xref ref-type="fig" rid="fig-110">110</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-106.tif"/>
</fig>
<p>where <inline-formula id="ieqn-2161"><mml:math id="mml-ieqn-2161"><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> is the number of contact points (which is the same as the coordination number <inline-formula id="ieqn-2162"><mml:math id="mml-ieqn-2162"><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:math></inline-formula>), and <inline-formula id="ieqn-2163"><mml:math id="mml-ieqn-2163"><mml:msub><mml:mi>n</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> the unit normal vector at contact point <inline-formula id="ieqn-2164"><mml:math id="mml-ieqn-2164"><mml:mi>c</mml:mi></mml:math></inline-formula>.</p>
<statement id="st11_1"><title>Remark 11.1.</title>
<p>Even though in Figure <xref ref-type="fig" rid="fig-14">14</xref> (row 1) in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, the microscale RVE was indicated to be of micron size, but the microscale RVE in Figure <xref ref-type="fig" rid="fig-106">106</xref> was of size <inline-formula id="ieqn-2165"><mml:math id="mml-ieqn-2165"><mml:mn>10</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>10</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>5</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow></mml:math></inline-formula> with particles of <inline-formula id="ieqn-2166"><mml:math id="mml-ieqn-2166"><mml:mn>0.5</mml:mn><mml:mrow><mml:mtext>&#xA0;cm</mml:mtext></mml:mrow></mml:math></inline-formula> in diameter, many orders of magnitude larger. See Remark <xref ref-type="statement" rid="st11_9">11.9</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p>
</statement><p>Other microstructure data such as the porosity and the coordination number (or number of contact point), being scalars and did not incorporate directional data like the fabric tensor, did not help to improve the accuracy of the network prediction, as noted in the caption of Figure <xref ref-type="fig" rid="fig-17">17</xref>.</p>
<p>To enforce objectivity of constitutive models realized as neural networks, (the history of) principal strains and incremental rotation parameters that describe the orientation of principal directions served as inputs to the network. Accordingly, principal stresses and incremental rotations for the principal directions were outputs of what was referred to as <italic>Spectral RNNs</italic> [<xref ref-type="bibr" rid="ref-25">25</xref>], which preserved objectivity of constitutive models.</p>
<fig id="fig-107">
<label>Figure 107</label>
<caption><title><italic>Optimal RNN-LSTM architecture</italic> (Section <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>). 5 different configurations of RNNs with LSTM units [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Table reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-107.tif"/>
</fig>
</sec>
<sec id="s11_3_3"><label>11.3.3</label>
<title>Optimal RNN-LSTM architecture</title>
<p>Using the same discrete element assembly of microscale RVE in Figure <xref ref-type="fig" rid="fig-106">106</xref> to generate data, the authors of [<xref ref-type="bibr" rid="ref-25">25</xref>] tried out 5 different configurations of RNNs with LSTM units, with 2 or 3 hidden layers, 50 to 100 LSTM units (Figure <xref ref-type="fig" rid="fig-15">15</xref>, Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, and Figure <xref ref-type="fig" rid="fig-81">81</xref> for the detailed original LSTM cell), and either logistic sigmoid or ReLU as activation function, Figure <xref ref-type="fig" rid="fig-107">107</xref>.</p>
<fig id="fig-108">
<label>Figure 108</label>
<caption><title><italic>Optimal RNN-LSTM architecture</italic> (Section <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>). Training error and test errors for 5 different configurations of RNN with LSTM units, see Figure <xref ref-type="fig" rid="fig-107">107</xref> [<xref ref-type="bibr" rid="ref-25">25</xref>] (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-108.tif"/>
</fig>
<p>Configuration 1 has 2 hidden layers with 50 LSTM units each, and with the logistic sigmoid as activation function. Config 2 is similar, but with 80 LSTM units per hidden layer. Config 3 is similar, but with 100 LSTM units per hidden layer. Config 4 is similar to Config 2, but with 3 hidden layers. Config 5 is similar to Config 4, but with ReLU activation function.</p>
<p>The training error and test error obtained from using these 5 configurations are shown in Figure <xref ref-type="fig" rid="fig-108">108</xref>. The zoomed-in views of the training error and test error from epoch 3000 to epoch 5000 in Figure <xref ref-type="fig" rid="fig-109">109</xref> show that Config 5 was the optimal with smaller errors, and since ReLU would be computationally more efficient than the logistic sigmoid. But Config 2 was, however, selected in [<xref ref-type="bibr" rid="ref-25">25</xref>], whose authors cited that the discrepancy was &#x201C;not significant&#x201D;, and that Config 2 gave &#x201C;good training and prediction performances&#x201D;.</p>
<statement id="st11_2"><title>Remark 11.2.</title>
<p>The above search for an optimal network architecture is similar to searching for an appropriate degree of a polynomial function for a best fit, avoiding overfit and underfit, over a given set of data points in a least-square curve fittings. See Figure <xref ref-type="fig" rid="fig-72">72</xref> in Section <xref ref-type="sec" rid="s6_5_9">6.5.9</xref> for an explanation of underfit and overfit, and Figure <xref ref-type="fig" rid="fig-99">99</xref> in Section <xref ref-type="sec" rid="s10">10</xref> for a similar search of an optimal network for numerical integration by ANN.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st11_3"><title>Remark 11.3.</title>
<p>Referring to Remark <xref ref-type="statement" rid="st4_3">4.3</xref> and the neural network in Figure <xref ref-type="fig" rid="fig-14">14</xref> and to our definition of action depth as total number of action layers <inline-formula id="ieqn-2167"><mml:math id="mml-ieqn-2167"><mml:mi>L</mml:mi></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-23">23</xref> in Section <xref ref-type="sec" rid="s4_3_1">4.3.1</xref> and Remark <xref ref-type="statement" rid="st4_5">4.5</xref> in Section <xref ref-type="sec" rid="s4_6">4.6</xref>, it is clear that a network layer in [<xref ref-type="bibr" rid="ref-25">25</xref>] is a state layer, i.e., an input matrix <inline-formula id="ieqn-2168"><mml:math id="mml-ieqn-2168"><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, with <inline-formula id="ieqn-2169"><mml:math id="mml-ieqn-2169"><mml:mi>&#x2113;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula>. Thus all configs in Figure <xref ref-type="fig" rid="fig-107">107</xref> with 2 hidden layers have an action depth of <inline-formula id="ieqn-2170"><mml:math id="mml-ieqn-2170"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> layers and a state depth of <inline-formula id="ieqn-2171"><mml:math id="mml-ieqn-2171"><mml:mi>L</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> layers, whereas Config 4 with 3 hidden layers has an action depth of <inline-formula id="ieqn-2172"><mml:math id="mml-ieqn-2172"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> layers and a state depth of <inline-formula id="ieqn-2173"><mml:math id="mml-ieqn-2173"><mml:mi>L</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula> layers. On the other hand, since RNNs with LSTM units were used, in view of Remark <xref ref-type="statement" rid="st7_2">7.2</xref>, these networks were equivalent to &#x201C;very deep feedforward networks&#x201D;.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p>
</statement> 
<statement id="st11_4"><title>Remark 11.4.</title>
<p>The same <italic>selected</italic> architecture of RNN with LSTM units on both the microscale RVE (Figure <xref ref-type="fig" rid="fig-106">106</xref>, Figure <xref ref-type="fig" rid="fig-110">110</xref>) and the mesoscale RVE (Figure <xref ref-type="fig" rid="fig-112">112</xref>, Figure <xref ref-type="fig" rid="fig-113">113</xref>) was used to produce the mesoscale RNN with LSTM units (&#x201C;Mesoscale data-driven constitutive model&#x201D;) and the macroscale RNN with LSTM units (&#x201C;Macroscale data-driven constitutive model&#x201D;), respectively [<xref ref-type="bibr" rid="ref-25">25</xref>].&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-109">
<label>Figure 109</label>
<caption><title><italic>Optimal RNN-LSTM architecture</italic> (Section <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>). Training error (a) and testing error (b), close-up views of Figure <xref ref-type="fig" rid="fig-108">108</xref> from epoch 3000 to epoch 5000: Config 5 (purple line with dots) was optimal, with smaller errors than those of Config 2 (blue dashed line). See Figure <xref ref-type="fig" rid="fig-107">107</xref> for config details [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-109.tif"/>
</fig>
</sec>
<sec id="s11_3_4"><label>11.3.4</label>
<title>Dual-porosity dual-permeability governing equations</title>
<p>The governing equations for media with dual-porosity dual-permeability in this section are only applied to macroscale (field-size) simulations, i.e., not for simulations with the microscale RVE (Figure <xref ref-type="fig" rid="fig-106">106</xref>, Figure <xref ref-type="fig" rid="fig-110">110</xref>) and with the mesoscale RVE (Figure <xref ref-type="fig" rid="fig-112">112</xref>, Figure <xref ref-type="fig" rid="fig-113">113</xref>).</p>
<p>For field-size simulations, assuming stationary conditions, small deformations, incompressibility, no mass exchange among solid and fluid constituents, the problem is governed by the balance of linear momentum and the balance of fluid mass in micropores and macropores, respectively. The displacement field of the solid <inline-formula id="ieqn-2174"><mml:math id="mml-ieqn-2174"><mml:mi>u</mml:mi></mml:math></inline-formula>, the micropore pressure <inline-formula id="ieqn-2175"><mml:math id="mml-ieqn-2175"><mml:msub><mml:mi>p</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:math></inline-formula> and the macropore pressure <inline-formula id="ieqn-2176"><mml:math id="mml-ieqn-2176"><mml:msub><mml:mi>p</mml:mi><mml:mi>M</mml:mi></mml:msub></mml:math></inline-formula> constitute the primary unknowns of the problem. The (total) Cauchy stress tensor <inline-formula id="ieqn-2177"><mml:math id="mml-ieqn-2177"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is the sum of the effective stress tensor <inline-formula id="ieqn-2178"><mml:math id="mml-ieqn-2178"><mml:msup>  <mml:mi>&#x03C3;</mml:mi>  <mml:mo>&#x0027;</mml:mo> </mml:msup> </mml:math></inline-formula> on the solid skeleton and the pore fluid pressure <inline-formula id="ieqn-2179"><mml:math id="mml-ieqn-2179"><mml:msub><mml:mi>p</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula>, which was in between <inline-formula id="ieqn-2180"><mml:math id="mml-ieqn-2180"><mml:msub><mml:mi>p</mml:mi><mml:mi>M</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-2181"><mml:math id="mml-ieqn-2181"><mml:msub><mml:mi>p</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:math></inline-formula>, and assumed to be a convex combination of the latter two pore pressures [<xref ref-type="bibr" rid="ref-25">25</xref>],</p>
<p><disp-formula id="eqn-405"><label>(405)</label><mml:math id="mml-eqn-405" display="block"><mml:mi>&#x03C3;</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msub><mml:mi>p</mml:mi><mml:mi>f</mml:mi> </mml:msub><mml:mi>I</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x21D2;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>&#x03C3;</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mi>&#x03C8;</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:msub><mml:mi>p</mml:mi><mml:mi>M</mml:mi></mml:msub><mml:mtext>&#x2009;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>&#x03C8;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x2009;</mml:mtext><mml:msub><mml:mi>p</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mtext>&#x2009;</mml:mtext><mml:mi>I</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2182"><mml:math id="mml-ieqn-2182"><mml:mi>&#x03C8;</mml:mi></mml:math></inline-formula> denoted the ratio of macropore volume over total pore volume, and <inline-formula id="ieqn-2183"><mml:math id="mml-ieqn-2183"><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>&#x03C8;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> the ratio of micropore volume over total pore volume.</p>
<p>The <italic>balance of linear momemtum</italic> equation is written as</p>
<p><disp-formula id="eqn-406"><label>(406)</label><mml:math id="mml-eqn-406" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>div&#xA0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">&#x03C3;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2184"><mml:math id="mml-ieqn-2184"><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula> is the total mass density, <inline-formula id="ieqn-2185"><mml:math id="mml-ieqn-2185"><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:math></inline-formula> the acceleration of gravity,</p>
<p><disp-formula id="eqn-407"><label>(407)</label><mml:math id="mml-eqn-407" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>:=</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msub><mml:mrow><mml:mover><mml:mi>v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>:=</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>v</mml:mi><mml:mi>,</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2186"><mml:math id="mml-ieqn-2186"><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:math></inline-formula> is the velocity of the solid skeleton, <inline-formula id="ieqn-2187"><mml:math id="mml-ieqn-2187"><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-2188"><mml:math id="mml-ieqn-2188"><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are the fluid velocities in the macropores and in the micropores, respectively. In Eq. (<xref ref-type="disp-formula" rid="eqn-406">406</xref>), the coefficient <inline-formula id="ieqn-2189"><mml:math id="mml-ieqn-2189"><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> of fluid mass transfer between macropores and micropores was assumed to obey the &#x201C;semi-empirical&#x201D; relation [<xref ref-type="bibr" rid="ref-25">25</xref>]:</p>
<p><disp-formula id="eqn-408"><label>(408)</label><mml:math id="mml-eqn-408" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>M</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-2190"><mml:math id="mml-ieqn-2190"><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> being a parameter that characterized the interface permeability between macropores and micropores, and <inline-formula id="ieqn-2191"><mml:math id="mml-ieqn-2191"><mml:msub><mml:mi>&#x00B5;</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula> the dynamic viscosity of the fluid.</p>
<p>In the 1-D case, Darcy&#x2019;s law is written as</p>
<p><disp-formula id="eqn-409"><label>(409)</label><mml:math id="mml-eqn-409" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi>q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mfrac><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mi>k</mml:mi><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mfrac><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2192"><mml:math id="mml-ieqn-2192"><mml:mi>q</mml:mi></mml:math></inline-formula> is the fluid mass flux (<inline-formula id="ieqn-2193"><mml:math id="mml-ieqn-2193"><mml:mrow><mml:mtext>kg/</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mtext>s</mml:mtext><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>), <inline-formula id="ieqn-2194"><mml:math id="mml-ieqn-2194"><mml:mrow><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:math></inline-formula> the fluid mass density (<inline-formula id="ieqn-2195"><mml:math id="mml-ieqn-2195"><mml:mrow><mml:mtext>kg/</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow><mml:mn>3</mml:mn></mml:msup></mml:math></inline-formula>), <inline-formula id="ieqn-2196"><mml:math id="mml-ieqn-2196"><mml:mi>v</mml:mi></mml:math></inline-formula> the fluid velocity (<inline-formula id="ieqn-2197"><mml:math id="mml-ieqn-2197"><mml:mrow><mml:mtext>m</mml:mtext><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mtext>s</mml:mtext></mml:mrow></mml:math></inline-formula>), <inline-formula id="ieqn-2198"><mml:math id="mml-ieqn-2198"><mml:mi>k</mml:mi></mml:math></inline-formula> the medium permeability (<inline-formula id="ieqn-2199"><mml:math id="mml-ieqn-2199"><mml:msup><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>), <inline-formula id="ieqn-2200"><mml:math id="mml-ieqn-2200"><mml:mrow><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:math></inline-formula> the fluid dynamic viscosity (<inline-formula id="ieqn-2201"><mml:math id="mml-ieqn-2201"><mml:mrow><mml:mtext>Pa</mml:mtext></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>N s</mml:mtext></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>), <inline-formula id="ieqn-2202"><mml:math id="mml-ieqn-2202"><mml:mi>p</mml:mi></mml:math></inline-formula> the pressure (<inline-formula id="ieqn-2203"><mml:math id="mml-ieqn-2203"><mml:mrow><mml:mtext>N</mml:mtext></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>), and <inline-formula id="ieqn-2204"><mml:math id="mml-ieqn-2204"><mml:mi>x</mml:mi></mml:math></inline-formula> the distance (m).</p>
<fig id="fig-110">
<label>Figure 110</label>
<caption><title><italic>Mesoscale RNN with LSTM units. Traction-separation law</italic> (Sections <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>, <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). <italic>Left:</italic> Sequence of imposed displacement jumps on microscale RVE (Figure <xref ref-type="fig" rid="fig-106">106</xref>), normal (solid line) and tangential (dotted line, with <inline-formula id="ieqn-2205"><mml:math id="mml-ieqn-2205"><mml:mrow><mml:msub><mml:mi>u</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2261;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>u</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-106">106</xref>, center). <italic>Right:</italic> Normal traction vs. normal displacement. Cyclic loading and unloading. Microscale RVE training data (blue) vs. Mesoscale RNN with LSTM prediction (red, Section <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>), with mean squared error 
<inline-formula id="ieqn-2206a"><mml:math id="mml-ieqn-2206a"><mml:mrow><mml:mn>3.73</mml:mn><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x00D7;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mrow><mml:mn>10</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. See also Figure <xref ref-type="fig" rid="fig-115">115</xref> on the macroscale RNN with LSTM units [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-110.tif"/>
</fig>
<statement id="st11_5"><title>Remark 11.5.</title>
<p><italic>Dimension of</italic> <inline-formula id="ieqn-2206"><mml:math id="mml-ieqn-2206"><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-408">408</xref>). Each term in the balance of linear momentum Eq. (<xref ref-type="disp-formula" rid="eqn-406">406</xref>) has force per unit volume (<inline-formula id="ieqn-2207"><mml:math id="mml-ieqn-2207"><mml:mi>F</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup></mml:math></inline-formula>) as dimension, which is therefore the dimension of the right-hand side of Eq. (<xref ref-type="disp-formula" rid="eqn-406">406</xref>), where <inline-formula id="ieqn-2208"><mml:math id="mml-ieqn-2208"><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> appears. As a result, <inline-formula id="ieqn-2209"><mml:math id="mml-ieqn-2209"><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> has the dimension of mass (<inline-formula id="ieqn-2210"><mml:math id="mml-ieqn-2210"><mml:mi>M</mml:mi></mml:math></inline-formula>) per unit volume (<inline-formula id="ieqn-2211"><mml:math id="mml-ieqn-2211"><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup></mml:math></inline-formula>) per unit time (<inline-formula id="ieqn-2212"><mml:math id="mml-ieqn-2212"><mml:mi>T</mml:mi></mml:math></inline-formula>):</p>
<p><disp-formula id="eqn-410"><label>(410)</label><mml:math id="mml-eqn-410" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mi>F</mml:mi><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>T</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>T</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mfrac><mml:mi>T</mml:mi><mml:mi>L</mml:mi></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup><mml:mi>T</mml:mi></mml:mrow></mml:mfrac><mml:mtext mathcolor="block">.</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Another way to verify is to identify the right-hand side of Eq. (<xref ref-type="disp-formula" rid="eqn-406">406</xref>) with the usual inertia force per unit volume:</p>
<p><disp-formula id="eqn-411"><label>(411)</label><mml:math id="mml-eqn-411" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">v</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:mfrac><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup><mml:mi>T</mml:mi></mml:mrow></mml:mfrac><mml:mtext mathcolor="block">.</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The empirical relation Eq. (<xref ref-type="disp-formula" rid="eqn-408">408</xref>) adopted in [<xref ref-type="bibr" rid="ref-25">25</xref>] implies that the dimension of <inline-formula id="ieqn-2213"><mml:math id="mml-ieqn-2213"><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> was</p>
<p><disp-formula id="eqn-412"><label>(412)</label><mml:math id="mml-eqn-412" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>M</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>M</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup><mml:mi>T</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mi>M</mml:mi><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup></mml:mfrac><mml:mtext mathcolor="block">,</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>i.e., mass density, whereas permeability has the dimension of area (<inline-formula id="ieqn-2214"><mml:math id="mml-ieqn-2214"><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>), as seen in Darcy&#x2019;s law Eq. (<xref ref-type="disp-formula" rid="eqn-409">409</xref>). It is not clear why it is written in [<xref ref-type="bibr" rid="ref-25">25</xref>] that <inline-formula id="ieqn-2215"><mml:math id="mml-ieqn-2215"><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> characterized the &#x201C;interface permeability between the macropores and the micropores&#x201D;. A reason could be that the &#x201C;semi-empirical&#x201D; relation Eq. (<xref ref-type="disp-formula" rid="eqn-408">408</xref>) was assumed to be analogous to Darcy&#x2019;s law Eq. (<xref ref-type="disp-formula" rid="eqn-409">409</xref>). Moreover, as a result of Eq. (<xref ref-type="disp-formula" rid="eqn-412">412</xref>), the dimension of <inline-formula id="ieqn-2216"><mml:math id="mml-ieqn-2216"><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is therefore the same as that of the kinematic viscosity <inline-formula id="ieqn-2217"><mml:math id="mml-ieqn-2217"><mml:msub><mml:mi>v</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula>. If <inline-formula id="ieqn-2218"><mml:math id="mml-ieqn-2218"><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> had the dimension of permeability, then the choice of the right-hand side of the balance of linear momentum Eq. (<xref ref-type="disp-formula" rid="eqn-406">406</xref>) was inconsistent, dimensionally speaking. See Remark <xref ref-type="statement" rid="st11_6">11.6</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-111">
<label>Figure 111</label>
<caption><title><italic>Continuum with embedded strong discontinuity</italic> (Section <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). Domain <inline-formula id="ieqn-1965"><mml:math id="mml-ieqn-1965"><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msup><mml:mo>&#x222A;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> with embedded discontinuity surface <inline-formula id="ieqn-1966"><mml:math id="mml-ieqn-1966"><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:math></inline-formula>, running through the middle of a narrow band (light blue) <inline-formula id="ieqn-1967"><mml:math id="mml-ieqn-1967"><mml:msub><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x222A;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2282;</mml:mo><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:math></inline-formula> between the parallel surfaces <inline-formula id="ieqn-1968"><mml:math id="mml-ieqn-1968"><mml:msup><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-1969"><mml:math id="mml-ieqn-1969"><mml:msup><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. Objects behind <inline-formula id="ieqn-1970"><mml:math id="mml-ieqn-1970"><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:math></inline-formula> in the negative direction of the normal <inline-formula id="ieqn-1971"><mml:math id="mml-ieqn-1971"><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:math></inline-formula> to <inline-formula id="ieqn-1972"><mml:math id="mml-ieqn-1972"><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:math></inline-formula> are designated with the minus sign, and those in front of <inline-formula id="ieqn-1973"><mml:math id="mml-ieqn-1973"><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:math></inline-formula> with the plus sign. The narrow band <inline-formula id="ieqn-1974"><mml:math id="mml-ieqn-1974"><mml:msub><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents an embedded strong discontinuity, as shown in the mesoscale RVE in Figure <xref ref-type="fig" rid="fig-113">113</xref>, where the discretized strong discontinuity zones were a network of straight narrow bands. <italic>Top right:</italic> No sliding, tnterpretation of <inline-formula id="ieqn-1975"><mml:math id="mml-ieqn-1975"><mml:msub><mml:mi>u</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>+</mml:mo><mml:mo>&#x301A;</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>&#x301B;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-25">25</xref>]. <italic>Bottom right:</italic> Sliding, interpretation of <inline-formula id="ieqn-1976"><mml:math id="mml-ieqn-1976"><mml:mi>u</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mrow><mml:mo>&#x301A;</mml:mo><mml:mi>u</mml:mi><mml:mo>&#x301B;</mml:mo></mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-297">297</xref>].</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-111.tif"/>
</fig>
<p>The (local) porosity <inline-formula id="ieqn-2219"><mml:math id="mml-ieqn-2219"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula> is the ratio of the void volume <inline-formula id="ieqn-2220"><mml:math id="mml-ieqn-2220"><mml:mi>d</mml:mi><mml:msub><mml:mi>V</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula> over the total volume <inline-formula id="ieqn-2221"><mml:math id="mml-ieqn-2221"><mml:mi>d</mml:mi><mml:mi>V</mml:mi></mml:math></inline-formula>. Within the void volume, let <inline-formula id="ieqn-2222"><mml:math id="mml-ieqn-2222"><mml:mi>&#x03C8;</mml:mi></mml:math></inline-formula> be the percentage of macropores, and <inline-formula id="ieqn-2223"><mml:math id="mml-ieqn-2223"><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the percentage of micropores; we have:</p>
<p><disp-formula id="eqn-413"><label>(413)</label><mml:math id="mml-eqn-413" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03D5;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>V</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mi>&#x03C8;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>V</mml:mi><mml:mi>M</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>V</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C8;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>V</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>V</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mtext mathcolor="block">.</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The absolute macropore flux <inline-formula id="ieqn-2224"><mml:math id="mml-ieqn-2224"><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the absolute micropore flux <inline-formula id="ieqn-2225"><mml:math id="mml-ieqn-2225"><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are defined as follows:</p>
<p><disp-formula id="eqn-414"><label>(414)</label><mml:math id="mml-eqn-414" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>&#x03C8;</mml:mi><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mtext mathcolor="block">,</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>There is a conservation of fluid transfer between the macropores and the micropores across any closed surface <inline-formula id="ieqn-2226"><mml:math id="mml-ieqn-2226"><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:math></inline-formula>, i.e.,</p>
<p><disp-formula id="eqn-415"><label>(415)</label><mml:math id="mml-eqn-415" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mo>&#x222B;</mml:mo><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">n</mml:mi><mml:mi>d</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mtext mathcolor="block">.</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Generalizing Eq. (<xref ref-type="disp-formula" rid="eqn-409">409</xref>) to 3-D, Darcy&#x2019;s law in tensor form governs the fluid mass fluxes <inline-formula id="ieqn-2227"><mml:math id="mml-ieqn-2227"><mml:msub><mml:mi>q</mml:mi><mml:mi>M</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2228"><mml:math id="mml-ieqn-2228"><mml:msub><mml:mi>q</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:math></inline-formula> is written as</p>
<p><disp-formula id="eqn-416"><label>(416)</label><mml:math id="mml-eqn-416" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mfrac><mml:msub><mml:mi mathvariant="bold-italic">k</mml:mi><mml:mi>M</mml:mi></mml:msub><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mi>M</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mfrac><mml:msub><mml:mi mathvariant="bold-italic">k</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2229"><mml:math id="mml-ieqn-2229"><mml:msub><mml:mi>k</mml:mi><mml:mi>M</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2230"><mml:math id="mml-ieqn-2230"><mml:msub><mml:mi>k</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:math></inline-formula> denote the permeability tensors on the respective scales and <inline-formula id="ieqn-2231"><mml:math id="mml-ieqn-2231"><mml:mi>g</mml:mi></mml:math></inline-formula> is the gravitational acceleration.</p>
<p>From Eq. (<xref ref-type="disp-formula" rid="eqn-415">415</xref>), assuming that</p>
<p><disp-formula id="eqn-417"><label>(417)</label><mml:math id="mml-eqn-417" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2232"><mml:math id="mml-ieqn-2232"><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is given in Eq. (<xref ref-type="disp-formula" rid="eqn-408">408</xref>), then two more governing equations are obtained:</p>
<p><disp-formula id="eqn-418"><label>(418)</label><mml:math id="mml-eqn-418" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-419"><label>(419)</label><mml:math id="mml-eqn-419" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>agreeing with [<xref ref-type="bibr" rid="ref-25">25</xref>].</p> 
<statement id="st11_6"><title>Remark 11.6.</title>
<p>It can be verified from Eq. (<xref ref-type="disp-formula" rid="eqn-417">417</xref>) that the dimension of <inline-formula id="ieqn-2233"><mml:math id="mml-ieqn-2233"><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is</p>
<p><disp-formula id="eqn-420"><label>(420)</label><mml:math id="mml-eqn-420" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mi>L</mml:mi></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>L</mml:mi></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mn>3</mml:mn></mml:msup><mml:mi>T</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>agreeing with Eq. (<xref ref-type="disp-formula" rid="eqn-411">411</xref>). In view of Remark <xref ref-type="statement" rid="st11_5">11.5</xref>, for all three governing Eq. (<xref ref-type="disp-formula" rid="eqn-406">406</xref>), <xref ref-type="disp-formula" rid="eqn-418">Eqs. (418)</xref>-<xref ref-type="disp-formula" rid="eqn-419">(419)</xref> to be dimensionally consistent, <inline-formula id="ieqn-2234"><mml:math id="mml-ieqn-2234"><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-408">408</xref>) should have the same dimension as mass density, as indicated in Eq. (<xref ref-type="disp-formula" rid="eqn-412">412</xref>). Our consistent notation for fluxes <inline-formula id="ieqn-2235"><mml:math id="mml-ieqn-2235"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2236"><mml:math id="mml-ieqn-2236"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">q</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> differ in meaning comparedto [<xref ref-type="bibr" rid="ref-25">25</xref>].&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-112">
<label>Figure 112</label>
<caption><title><italic>Mesoscale RVE</italic> (Sections <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>, <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). A 2-D domain of size <inline-formula id="ieqn-1977"><mml:math id="mml-ieqn-1977"><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#xA0;m</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#xA0;m</mml:mtext></mml:mrow></mml:math></inline-formula> (Remark <xref ref-type="statement" rid="st11_9">11.9</xref>). See Figure <xref ref-type="fig" rid="fig-14">14</xref> (row 2, left), and also Figure <xref ref-type="fig" rid="fig-111">111</xref> for a conceptual representation. Embedded strong discontinuity zones where damage occurred formed a network of straight narrow bands surrounded by elastic material. Both the strong discontinuity narrow bands and the elastic domain were discretized into finite elements. Imposed displacements <inline-formula id="ieqn-1978"><mml:math id="mml-ieqn-1978"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi>M</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-1979"><mml:math id="mml-ieqn-1979"><mml:msub><mml:mi>u</mml:mi><mml:mi>S</mml:mi></mml:msub><mml:mo>&#x2261;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:math></inline-formula>, at the top (center). See Figure <xref ref-type="fig" rid="fig-113">113</xref> for the deformation (strains and displacement jumps) [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-112.tif"/>
</fig>
<statement id="st11_7"><title>Remark 11.7.</title>
<p>For field-size simulations, the above equations do not include the changing size of the pores, which were assumed to be of constant size, and thus constant porosity, in [<xref ref-type="bibr" rid="ref-25">25</xref>]. As a result, the collapse of the pores that leads to nonlinearity in the stress-strain relation observed in experiments (Figure <xref ref-type="fig" rid="fig-13">13</xref>) is not modelled in [<xref ref-type="bibr" rid="ref-25">25</xref>], where the nonlinearity essentially came from the embedded strong discontinuities (displacement jumps) and the associated traction-separation law obtained from DEM simulations using the micro RVE in Figure <xref ref-type="fig" rid="fig-106">106</xref> to train the meso RNN with LSTM; see Section <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>. See also Remark <xref ref-type="statement" rid="st11_10">11.10</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p>
</statement> </sec>
<sec id="s11_3_5"><label>11.3.5</label>
<title>Embedded strong discontinuities, traction-separation law</title>
<p>Strong discontinuities are embedded at both the mesocale and the macroscale. Once a fault is formed through cracks in rocks, it could become inactive (no further slip) due to surrounding stresses, friction between the two surfaces of a fault, cohesive bond, and low fluid pore pressure. A fault can be reactivated (onset of renewed fault slip) due to changing stress state, loosened fault cohesion, high fluid pore-pressure. Conventional models for fault reactivation are based on effective stresses and Coulomb law [<xref ref-type="bibr" rid="ref-298">298</xref>]:
<disp-formula id="eqn-421"><label>(421)</label><mml:math id="mml-eqn-421" display="block"><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2265;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>&#x03C4;</mml:mi><mml:mi>p</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>C</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>C</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x00B5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>p</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2237"><mml:math id="mml-ieqn-2237"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> is the shear stress along the fault line, <inline-formula id="ieqn-2238"><mml:math id="mml-ieqn-2238"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:math></inline-formula> the critical shear stress for the onset of fault reactivation, <inline-formula id="ieqn-2239"><mml:math id="mml-ieqn-2239"><mml:mi>C</mml:mi></mml:math></inline-formula> the cohesion strength, <inline-formula id="ieqn-2240"><mml:math id="mml-ieqn-2240"><mml:mi>&#x03BC;</mml:mi></mml:math></inline-formula> the coefficient of friction, <inline-formula id="ieqn-2241"><mml:math id="mml-ieqn-2241"><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup></mml:math></inline-formula> the effective stress normal to the fault line, <inline-formula id="ieqn-2242"><mml:math id="mml-ieqn-2242"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> the normal stress, and <inline-formula id="ieqn-2243"><mml:math id="mml-ieqn-2243"><mml:mi>p</mml:mi></mml:math></inline-formula> the fluid pore pressure. The authors of [<xref ref-type="bibr" rid="ref-299">299</xref>] demonstrated that increase in fluid injection rate led to increase in peak fluid pressure and, as a result, fault reactivation, as part of a study on why there was an exponential increase in seismic activities in Oklahoma, due to wastewater injection for use in hydraulic fracturing (i.e., fracking) [<xref ref-type="bibr" rid="ref-300">300</xref>].</p>
<fig id="fig-113">
<label>Figure 113</label>
<caption><title><italic>Mesoscale RVE</italic> (Section <xref ref-type="sec" rid="s11_3_3">11.3.3</xref>). Strains and displacement jumps [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-113.tif"/>
</fig>
<p>But criterion Eq. (<xref ref-type="disp-formula" rid="eqn-421">421</xref>) involves only stresses, with no displacement, and thus cannot be used to quantify the amount of fault slip. To allow for quantitative modeling of fault slip, or displacement jump, in a displacement-driven FEM environment, the so-called &#x201C;cohesive traction-separation laws,&#x201D; expressing traction (stress) vector on fault surface as a function of fault slip, similar to those used in modeling cohesive zone in nonlinear fracture mechanics [<xref ref-type="bibr" rid="ref-301">301</xref>], is needed. But these classical &#x201C;cohesive traction-separation law&#x201D; are not appropriate for handling loading-unloading cycles.</p>
<p>To model a continuum with displacement jumps, i.e., embedded strong discontinuities, the traction-separation law was represented in [<xref ref-type="bibr" rid="ref-25">25</xref>] as</p>
<p><disp-formula id="eqn-422"><label>(422)</label><mml:math id="mml-eqn-422" display="block"><mml:mrow><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2022;</mml:mo><mml:mi>n</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2244"><mml:math id="mml-ieqn-2244"><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:math></inline-formula> is the traction vector on the fault surface, <inline-formula id="ieqn-2245"><mml:math id="mml-ieqn-2245"><mml:mi>u</mml:mi></mml:math></inline-formula> the displacement field, <inline-formula id="ieqn-2246"><mml:math id="mml-ieqn-2246"><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> the jump operator, making <inline-formula id="ieqn-2247"><mml:math id="mml-ieqn-2247"><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> the displacement jump (fault slip, separation), <inline-formula id="ieqn-2248"><mml:math id="mml-ieqn-2248"><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup></mml:math></inline-formula> the effective stress tensor at the fault surface, <inline-formula id="ieqn-2249"><mml:math id="mml-ieqn-2249"><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:math></inline-formula> the normal to the fault surface. To obtain the traction-separation law represented by the function <inline-formula id="ieqn-2250"><mml:math id="mml-ieqn-2250"><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, a neural network can be used provided that data are available for training and testing.</p>
<p>It was assumed in [<xref ref-type="bibr" rid="ref-25">25</xref>] that cracks were pre-existing and did not propagate (see the mesoscale RVE in Figure <xref ref-type="fig" rid="fig-113">113</xref>), then set out to use the microscale RVE in Figure <xref ref-type="fig" rid="fig-106">106</xref> to generate training data and test data for the mesocale RNN with LSTM units, called the &#x201C;Mesoscale data-driven constitutive model&#x201D;, to represent the traction-separation law for porous media. Their results are shown in Figure <xref ref-type="fig" rid="fig-110">110</xref>.</p>
<statement id="st11_8"><title>Remark 11.8.</title>
<p>The microscale RVE in Figure <xref ref-type="fig" rid="fig-106">106</xref> did not represent any real-world porous rock sample such the Majella limestone with macroporosity <inline-formula id="ieqn-2251"><mml:math id="mml-ieqn-2251"><mml:mn>11.4</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> and microporosity <inline-formula id="ieqn-2252"><mml:math id="mml-ieqn-2252"><mml:mn>19.6</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>0.2</mml:mn></mml:math></inline-formula> shown in Figure <xref ref-type="fig" rid="fig-12">12</xref> in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, but was a rather simple assembly of mono-disperse (identical) solid spheres with no information on size and no clear contact force-displacement relation; see [<xref ref-type="bibr" rid="ref-302">302</xref>]. Another problem was that realistic porosity of 0.1 or 0.2 could not be achieved with this microscale RVE, yielding a porosity above 0.3, which was the total porosity (= macroporosity + microporosity) of the highly porous Majella limestone. A goal of [<xref ref-type="bibr" rid="ref-25">25</xref>] was only to demonstrate the methodology, not presenting realistic results.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p>
</statement>
<fig id="fig-114">
<label>Figure 114</label>
<caption><title><italic>Mesoscale RVE</italic> (Section <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). Validation of coupled FEM and RNN with LSTM units (FEM-LSTM, red dotted line) against coupled FEM and DEM (FEM-DEM, blue line) to analyze the mesoscale RVE in Figure <xref ref-type="fig" rid="fig-113">113</xref> under a sequence of imposed displacement jumps at the top (represented by numbers). (a) Normal traction (Tn) vs normal displacement (Un). (b) Tangential traction (Ts) vs tangential displacement (Us) [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-114.tif"/>
</fig>
<statement id="st11_9"><title>Remark 11.9.</title>
<p>Even though in Figure <xref ref-type="fig" rid="fig-14">14</xref> (row 2) in Section <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, the mesoscale RVE was indicated to be of centimeter size, but the mesoscale RVE in Figure <xref ref-type="fig" rid="fig-113">113</xref> was of size <inline-formula id="ieqn-2253"><mml:math id="mml-ieqn-2253"><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#xA0;m</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#xA0;m</mml:mtext></mml:mrow></mml:math></inline-formula>, many orders of magnitude larger. See Remark <xref ref-type="statement" rid="st11_1">11.1</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>To analyze the mesoscale RVE in Figure <xref ref-type="fig" rid="fig-113">113</xref> (Figure <xref ref-type="fig" rid="fig-104">104</xref>, center) and the macroscale (field-size) model (Figure <xref ref-type="fig" rid="fig-104">104</xref>, right) by finite elements, both with embedded strong discontinuities (Figure <xref ref-type="fig" rid="fig-111">111</xref>), the authors of [<xref ref-type="bibr" rid="ref-25">25</xref>] adopted a formulation that looked similar to [<xref ref-type="bibr" rid="ref-297">297</xref>] to represent strong discontinuities, which could result from fractures or shear bands, by the <italic>local</italic> displacement field <inline-formula id="ieqn-2254"><mml:math id="mml-ieqn-2254"><mml:msub><mml:mi>u</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula> as<xref ref-type="fn" rid="fn273"><sup>273</sup></xref><fn id="fn273"><label>273</label><p>The subscript <inline-formula id="ieqn-3314"><mml:math id="mml-ieqn-3314"><mml:mi>m</mml:mi><mml:mi>u</mml:mi></mml:math></inline-formula> in <inline-formula id="ieqn-3315"><mml:math id="mml-ieqn-3315"><mml:msub><mml:mi mathvariant='bold-italic'>u</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula> probably meant &#x201C;micro&#x201D;, and was used to designate the <italic>local</italic> nature of <inline-formula id="ieqn-3316"><mml:math id="mml-ieqn-3316"><mml:msub><mml:mi mathvariant='bold-italic'>u</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula>.</p></fn></p>
<p><disp-formula id="eqn-423"><label>(423)</label><mml:math id="mml-eqn-423" display="block"><mml:msub><mml:mi>u</mml:mi><mml:mi>&#x00B5;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>u</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>f</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mtext>&#x00A0;</mml:mtext><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>u</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x21D2;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>u</mml:mi><mml:mi>&#x00B5;</mml:mi></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mtext>&#x00A0;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>which differs from the global smooth displacement field <inline-formula id="ieqn-2255"><mml:math id="mml-ieqn-2255"><mml:mi>u</mml:mi></mml:math></inline-formula> by the displacement jump vector <inline-formula id="ieqn-2256"><mml:math id="mml-ieqn-2256"><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> across the singular surface <inline-formula id="ieqn-2257"><mml:math id="mml-ieqn-2257"><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:math></inline-formula> that represents a discontinuity, multiplied by the function <inline-formula id="ieqn-2258"><mml:math id="mml-ieqn-2258"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2259"><mml:math id="mml-ieqn-2259"><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub></mml:math></inline-formula> is the Heaviside function, such that <inline-formula id="ieqn-2260"><mml:math id="mml-ieqn-2260"><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> in <inline-formula id="ieqn-2261"><mml:math id="mml-ieqn-2261"><mml:msup><mml:mrow><mml:mi>&#x212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-2262"><mml:math id="mml-ieqn-2262"><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> in <inline-formula id="ieqn-2263"><mml:math id="mml-ieqn-2263"><mml:msup><mml:mrow><mml:mi>&#x212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, and <inline-formula id="ieqn-2264"><mml:math id="mml-ieqn-2264"><mml:msub><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub></mml:math></inline-formula> a smooth ramp function equal to zero in <inline-formula id="ieqn-2265"><mml:math id="mml-ieqn-2265"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and going smoothly up to 1 in <inline-formula id="ieqn-2266"><mml:math id="mml-ieqn-2266"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, as defined in [<xref ref-type="bibr" rid="ref-297">297</xref>]. So Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>) means that the displacement field <inline-formula id="ieqn-2267"><mml:math id="mml-ieqn-2267"><mml:mi>u</mml:mi></mml:math></inline-formula> only had a smooth &#x201C;bump&#x201D; with support being the band <inline-formula id="ieqn-2268"><mml:math id="mml-ieqn-2268"><mml:msub><mml:mrow><mml:mi>&#x212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as a result of a local displacement jump, with <italic>no</italic> fault sliding, as shown in the top right subfigure in Figure <xref ref-type="fig" rid="fig-111">111</xref>.</p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>)<inline-formula id="ieqn-2269"><mml:math id="mml-ieqn-2269"><mml:msub><mml:mi></mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> was the starting point in [<xref ref-type="bibr" rid="ref-297">297</xref>], but <italic>without</italic> using the definition of <inline-formula id="ieqn-2270"><mml:math id="mml-ieqn-2270"><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>)<inline-formula id="ieqn-2271"><mml:math id="mml-ieqn-2271"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>:<xref ref-type="fn" rid="fn274"><sup>274</sup></xref><fn id="fn274"><label>274</label><p>[<xref ref-type="bibr" rid="ref-297">297</xref>], Eq. (2.1).</p></fn></p>
<p><disp-formula id="eqn-424"><label>(424)</label><mml:math id="mml-eqn-424" display="block"><mml:mi>u</mml:mi><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mtext>&#x00A0;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x21D2;</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mover><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>+</mml:mo><mml:mrow><mml:mo>&#x301A;</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x301B;</mml:mo></mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-2272"><mml:math id="mml-ieqn-2272"><mml:mi>u</mml:mi></mml:math></inline-formula> being the total displacement field (including the jump), <inline-formula id="ieqn-2273"><mml:math id="mml-ieqn-2273"><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> the smooth part of <inline-formula id="ieqn-2274"><mml:math id="mml-ieqn-2274"><mml:mi>u</mml:mi></mml:math></inline-formula>, and the overhead dot representing the rate or increment. As such, Eq. (<xref ref-type="disp-formula" rid="eqn-424">424</xref>) can describe the sliding between <inline-formula id="ieqn-2275"><mml:math id="mml-ieqn-2275"><mml:msub><mml:mrow><mml:mi>&#x212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-2276"><mml:math id="mml-ieqn-2276"><mml:msub><mml:mrow><mml:mi>&#x212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula>, as shown in the bottom right subfigure in Figure <xref ref-type="fig" rid="fig-111">111</xref>. Assuming that the jump <inline-formula id="ieqn-2277"><mml:math id="mml-ieqn-2277"><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> has zero gradient in <inline-formula id="ieqn-2278"><mml:math id="mml-ieqn-2278"><mml:msub><mml:mrow><mml:mi>&#x212C;</mml:mi></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula>, take the gradient of rate form in Eq. (<xref ref-type="disp-formula" rid="eqn-424">424</xref>)<inline-formula id="ieqn-2279"><mml:math id="mml-ieqn-2279"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> and symmetrize to obtain the small-strain rate<xref ref-type="fn" rid="fn275"><sup>275</sup></xref><fn id="fn275"><label>275</label><p>[<xref ref-type="bibr" rid="ref-297">297</xref>], Eq. (2.2).</p></fn></p>
<fig id="fig-115">
<label>Figure 115</label>
<caption><title><italic>Macroscale RNN with LSTM units</italic> (Section <xref ref-type="sec" rid="s11_3_5">11.3.5</xref>). Normal traction (Tn) vs imposed displacement jumps (Un) on mesoscale RVE (Figure <xref ref-type="fig" rid="fig-112">112</xref>, Figure <xref ref-type="fig" rid="fig-113">113</xref>). <italic>Blue:</italic> Training data (TR1-TR3) and test data (TE1-TE3) from the mesoscale FEM-LSTM model, where numbers indicate the sequence of loading-unloading steps similar to those in Figure <xref ref-type="fig" rid="fig-110">110</xref> for the mesoscale RNN with LSTM units. <italic>Red:</italic> Corresponding predictions of the trained macroscale RNN with LSTM units. The mean squared error (MSE) was used as loss function [<xref ref-type="bibr" rid="ref-25">25</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-115.tif"/>
</fig>
<p><disp-formula id="eqn-425"><label>(425)</label><mml:math id="mml-eqn-425" display="block"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2207;</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>0</mml:mn><mml:mtext>&#x00A0;</mml:mtext><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2207;</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>n</mml:mi><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x21D2;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;sym</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2207;</mml:mo><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x00A0;sym</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x301A;</mml:mo> <mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>&#x301B;</mml:mo></mml:mrow></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2297;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>n</mml:mi><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mtext>&#x0393;</mml:mtext></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x00A0;with&#x00A0;sym</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi>b</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mfrac><mml:mrow><mml:mtext>1&#x00A0;</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mi>b</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Later in [<xref ref-type="bibr" rid="ref-297">297</xref>], an equation that looked similar to Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>)<inline-formula id="ieqn-2280"><mml:math id="mml-ieqn-2280"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, but in rate form, was introduced:<xref ref-type="fn" rid="fn276"><sup>276</sup></xref><fn id="fn276"><label>276</label><p>[<xref ref-type="bibr" rid="ref-297">297</xref>], Eq. (2.15).</p></fn></p>
<p><disp-formula id="eqn-426"><label>(426)</label><mml:math id="mml-eqn-426" display="block"><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mover><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>+</mml:mo><mml:mrow><mml:mo>&#x301A;</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x301B;</mml:mo></mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>+</mml:mo><mml:mrow><mml:mo>&#x301A;</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x301B;</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mo>&#x301A;</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x301B;</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mover><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>+</mml:mo><mml:mrow><mml:mo>&#x301A;</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x301B;</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mover><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mover><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mo>&#x00B7;</mml:mo></mml:mover><mml:mo>+</mml:mo><mml:mrow><mml:mo>&#x301A;</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x301B;</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>&#x0393;</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2281"><mml:math id="mml-ieqn-2281"><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, defined with the term <inline-formula id="ieqn-2282"><mml:math id="mml-ieqn-2282"><mml:mo>+</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, is not the same as <inline-formula id="ieqn-2283"><mml:math id="mml-ieqn-2283"><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> defined with the term <inline-formula id="ieqn-2284"><mml:math id="mml-ieqn-2284"><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>)<inline-formula id="ieqn-2285"><mml:math id="mml-ieqn-2285"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> of [<xref ref-type="bibr" rid="ref-25">25</xref>], even though Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>)<inline-formula id="ieqn-2286"><mml:math id="mml-ieqn-2286"><mml:msub><mml:mi></mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> looked similar to Eq. (<xref ref-type="disp-formula" rid="eqn-426">426</xref>)<inline-formula id="ieqn-2287"><mml:math id="mml-ieqn-2287"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>. From Eq. (<xref ref-type="disp-formula" rid="eqn-426">426</xref>)<inline-formula id="ieqn-2288"><mml:math id="mml-ieqn-2288"><mml:msub><mml:mi></mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula>, the small-strain rate <inline-formula id="ieqn-2289"><mml:math id="mml-ieqn-2289"><mml:mover><mml:mo mathvariant="italic">&#x2208;</mml:mo><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-425">425</xref>)<inline-formula id="ieqn-2290"><mml:math id="mml-ieqn-2290"><mml:msub><mml:mi></mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> in terms of the smoothed velocity <inline-formula id="ieqn-2291"><mml:math id="mml-ieqn-2291"><mml:mover><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x00B7;</mml:mo></mml:mover></mml:math></inline-formula> is<xref ref-type="fn" rid="fn277"><sup>277</sup></xref><fn id="fn277"><label>277</label><p>[<xref ref-type="bibr" rid="ref-297">297</xref>], Eq. (2.17).</p></fn></p>
<p><disp-formula id="eqn-427"><label>(427)</label><mml:math id="mml-eqn-427" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>sym</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mover><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mtext>sym</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2297;</mml:mo><mml:mi mathvariant="bold-italic">n</mml:mi><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>sym</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2297;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:msub><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which, when removing the overhead dot, is similar to, but different from, the small strain expression in [<xref ref-type="bibr" rid="ref-25">25</xref>], written as</p>
<p><disp-formula id="eqn-428"><label>(428)</label><mml:math id="mml-eqn-428" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="bold-italic">&#x03F5;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>sym</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mtext>sym</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2297;</mml:mo><mml:mi mathvariant="bold-italic">n</mml:mi><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>sym</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2297;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:msub><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the first term was <inline-formula id="ieqn-2292"><mml:math id="mml-ieqn-2292"><mml:mi mathvariant="bold-italic">u</mml:mi></mml:math></inline-formula> and not <inline-formula id="ieqn-2293"><mml:math id="mml-ieqn-2293"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>+</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:msub></mml:math></inline-formula> in the notation of [<xref ref-type="bibr" rid="ref-25">25</xref>].<xref ref-type="fn" rid="fn278"><sup>278</sup></xref><fn id="fn278"><label>278</label><p>Recall that <inline-formula id="ieqn-3317"><mml:math id="mml-ieqn-3317"><mml:mi mathvariant='bold-italic'>u</mml:mi></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-25">25</xref>] (Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>)<inline-formula id="ieqn-3318"><mml:math id="mml-ieqn-3318"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>), the &#x201C;large-scale (or conformal) displacement field&#x201D; without displacement jump, is equivalent to <inline-formula id="ieqn-3319"><mml:math id="mml-ieqn-3319"><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-297">297</xref>] (Eq. (<xref ref-type="disp-formula" rid="eqn-424">424</xref>)<inline-formula id="ieqn-3320"><mml:math id="mml-ieqn-3320"><mml:msub><mml:mi></mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>), but is of course not the <inline-formula id="ieqn-3321"><mml:math id="mml-ieqn-3321"><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-25">25</xref>] (Eq. (<xref ref-type="disp-formula" rid="eqn-423">423</xref>)<inline-formula id="ieqn-3322"><mml:math id="mml-ieqn-3322"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>).</p></fn></p>
<p>Typically, in this type of formulation [<xref ref-type="bibr" rid="ref-297">297</xref>], once the traction-separation law <inline-formula id="ieqn-2294"><mml:math id="mml-ieqn-2294"><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-422">422</xref>) was available (e.g., Figure <xref ref-type="fig" rid="fig-110">110</xref>), then given the traction <inline-formula id="ieqn-2295"><mml:math id="mml-ieqn-2295"><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:math></inline-formula>, the displacement jump <inline-formula id="ieqn-2296"><mml:math id="mml-ieqn-2296"><mml:mo stretchy="false">[</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> was solved for using Eq. (<xref ref-type="disp-formula" rid="eqn-422">422</xref>) at each Gauss point within a constant-strain triangular (CST) element [<xref ref-type="bibr" rid="ref-25">25</xref>].</p>
<p>At this point, it is no longer necessary to review further this continuum formulation for displacement jumps to return to the training of the macroscale RNN with LSTM units, which the authors of [<xref ref-type="bibr" rid="ref-25">25</xref>] called the &#x201C;Macroscale data-driven constitutive model&#x201D; (Figure <xref ref-type="fig" rid="fig-14">14</xref>, row 3, right), using the data generated from the simulations using the mesoscale RVE (Figure <xref ref-type="fig" rid="fig-113">113</xref>) and the mesoscale RNN with LSTM units, called the &#x201C;Mesoscale data-driven constitutive model&#x201D; (Figure <xref ref-type="fig" rid="fig-14">14</xref>, row 2, right), obtained earlier.</p>
<p>The mesoscale RNN with LSTM units (&#x201C;mesoscale data-driven constitutive model&#x201D;) was first validated using the mesoscale RVE with embedded discontinuities (Figure <xref ref-type="fig" rid="fig-113">113</xref>), discretized into finite elements, and subjected to imposed displacements at the top. This combined FEM and RNN with LSTM units on the mesoscale RVE is denoted FEM-LSTM, with results compared well with those obtained from the coupled FEM and DEM (denoted as FEM-DEM), as shown in Figure <xref ref-type="fig" rid="fig-114">114</xref>.</p>
<p>Once validated, the FEM-LSTM model for the mesocale RVE was used to generate data to train the macroscale RNN with LSTM units (called &#x201C;Macroscale data-driven constitutive model&#x201D;) by imposing displacement jumps at the top of the mesoscale RVE (Figure <xref ref-type="fig" rid="fig-112">112</xref>), very much like what was done with the microscale RVE (Figure <xref ref-type="fig" rid="fig-106">106</xref>, Figure <xref ref-type="fig" rid="fig-110">110</xref>), just at a larger scale.</p>
<p>The accuracy of the macroscale RNN with LSTM units (&#x201C;Macroscale data-driven constitutive model&#x201D;) is illustrated in Figure <xref ref-type="fig" rid="fig-115">115</xref>, where the normal tractions under displacement loading were compared to results obtained with the mesoscale RVE (Figure <xref ref-type="fig" rid="fig-112">112</xref>, Figure <xref ref-type="fig" rid="fig-113">113</xref>), which was used for generating the training data. Once established, the macroscale RNN with LSTM units is used in field-size macroscale simulations. Since there is no further interesting insights into the use of deep learning, we stop of review of [<xref ref-type="bibr" rid="ref-25">25</xref>] here.</p> 
<statement id="st11_10"><title>Remark 11.10.</title>
<p><italic>No non-linear stress-strain relation</italic>. In the end, the authors of [<xref ref-type="bibr" rid="ref-25">25</xref>] only used Figure <xref ref-type="fig" rid="fig-12">12</xref> to motivate the double porosity (in Majella limestone) in their macroscale modeling and simulations, which did not include the characteristic non-linear stress-strain relation found experimentally in Majella limestone as shown in Figure <xref ref-type="fig" rid="fig-13">13</xref>. All nonlinear responses considered in [<xref ref-type="bibr" rid="ref-25">25</xref>] came from the nonlinear traction-separation law obtained from DEM simulations in which the particles themselves were elastic, even though the Hertz contact force-displacement relation was nonlinear [<xref ref-type="bibr" rid="ref-303">303</xref>] [<xref ref-type="bibr" rid="ref-304">304</xref>] [<xref ref-type="bibr" rid="ref-302">302</xref>]. See Remark <xref ref-type="statement" rid="st11_7">11.7</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st11_11"><title>Remark 11.11.</title>
<p><italic>Physics-Informed Neural Networks (PINNs) applied to solid mechanics</italic>. The PINN method discussed in Section <xref ref-type="sec" rid="s9_5">9.5</xref> has been applied to problems in solid mechanics [<xref ref-type="bibr" rid="ref-305">305</xref>]: Linear elasticity (square plate, plane strain, trigonometric body force, with exact solution), nonlinear elasto-plasticity (perforated plate with circular hole, under plane-strain condition and von-Mises elastoplasticity, subjected to uniform extension, showing localized shear band). Less accuracy was encountered for solutions that presented discontinuiies (localized high gradients) in the materials properties or at the boundary conditions; see Remark <xref ref-type="statement" rid="st7_7">7.7</xref> and Remark <xref ref-type="statement" rid="st9_5">9.5</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec> </sec> </sec>
<sec id="s12"><label>12</label>
<title>Application 3: Fluids, turbulence, reduced-order models</title>
<p>The general ideas behind the work in [<xref ref-type="bibr" rid="ref-26">26</xref>] were presented in Section <xref ref-type="sec" rid="s2_3_3">2.3.3</xref> further above. In this section, we discuss some details of the formulation, starting with a brief primer on Proper Orthogonal Decomposition (POD) for unfamiliar readers.</p>
<sec id="s12_1"><label>12.1</label>
<title>Proper orthogonal decomposition (POD)</title>
<p>The presentation of the <italic>continuous</italic> formulation of POD in this section follows [<xref ref-type="bibr" rid="ref-306">306</xref>]. Consider the separation of variables of a time-dependent function <inline-formula id="ieqn-2297"><mml:math id="mml-ieqn-2297"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which could represent a component of a velocity field or a magnetic potential in a 3-D domain <inline-formula id="ieqn-2298"><mml:math id="mml-ieqn-2298"><mml:mrow><mml:mi>&#x212C;</mml:mi></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-2299"><mml:math id="mml-ieqn-2299"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-2300"><mml:math id="mml-ieqn-2300"><mml:mi>t</mml:mi></mml:math></inline-formula> a time parameter:</p>
<fig id="fig-116">
<label>Figure 116</label>
<caption><title><italic>2-D datasets for training neural networks</italic> (Sections <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>, <xref ref-type="sec" rid="s12_1">12.1</xref>). Extract 2-D datasets from 3-D turbulent flow field evolving in time. From the 3-D flow field, extract <inline-formula id="ieqn-1980"><mml:math id="mml-ieqn-1980"><mml:mi>N</mml:mi></mml:math></inline-formula> equidistant 2-D planes (slices). Within each 2-D plane, select a region (yellow square), and <inline-formula id="ieqn-1981"><mml:math id="mml-ieqn-1981"><mml:mi>k</mml:mi></mml:math></inline-formula> temporal snapshots of this region as it evolves in time to produce a dataset. Among these <inline-formula id="ieqn-1982"><mml:math id="mml-ieqn-1982"><mml:mi>N</mml:mi></mml:math></inline-formula> datasets, each containing <inline-formula id="ieqn-1983"><mml:math id="mml-ieqn-1983"><mml:mi>k</mml:mi></mml:math></inline-formula> snapshots of the same region within each 2-D plane, the majority of the datasets is used for training, and the rest for testing; see Remark <xref ref-type="statement" rid="st12_1">12.1</xref>. For each dataset, the reduced POD basis consists of <inline-formula id="ieqn-1984"><mml:math id="mml-ieqn-1984"><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> POD modes with highest eigenvalues of a matrix constructed from the <inline-formula id="ieqn-1985"><mml:math id="mml-ieqn-1985"><mml:mi>k</mml:mi></mml:math></inline-formula> snapshots (Figure <xref ref-type="fig" rid="fig-18">18</xref>) [<xref ref-type="bibr" rid="ref-26">26</xref>]. (Figure reproduced with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-116.tif"/>
</fig>
<p><disp-formula id="eqn-429"><label>(429)</label><mml:math id="mml-eqn-429" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2301"><mml:math id="mml-ieqn-2301"><mml:mo>&#x03B1;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the time-dependent amplitude, and <inline-formula id="ieqn-2302"><mml:math id="mml-ieqn-2302"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> a function of <inline-formula id="ieqn-2303"><mml:math id="mml-ieqn-2303"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:math></inline-formula> representing the most typical spatial structure of <inline-formula id="ieqn-2304"><mml:math id="mml-ieqn-2304"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The goal is to find the function <inline-formula id="ieqn-2305"><mml:math id="mml-ieqn-2305"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> that maximizes the square of the amplitude <inline-formula id="ieqn-2306"><mml:math id="mml-ieqn-2306"><mml:mo>&#x03B1;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (or component of <inline-formula id="ieqn-2307"><mml:math id="mml-ieqn-2307"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> along <inline-formula id="ieqn-2308"><mml:math id="mml-ieqn-2308"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>):</p>
<p><disp-formula id="eqn-430"><label>(430)</label><mml:math id="mml-eqn-430" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:msub><mml:mo mathvariant="italic">&#x222B;</mml:mo><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:msub><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-431"><label>(431)</label><mml:math id="mml-eqn-431" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>i.e., if <inline-formula id="ieqn-2309"><mml:math id="mml-ieqn-2309"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is normalized to 1, then the amplitude <inline-formula id="ieqn-2310"><mml:math id="mml-ieqn-2310"><mml:mo>&#x03B1;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the spatial scalar product between <inline-formula id="ieqn-2311"><mml:math id="mml-ieqn-2311"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2312"><mml:math id="mml-ieqn-2312"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2313"><mml:math id="mml-ieqn-2313"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> designating the spatial scalar product operator. Let <inline-formula id="ieqn-2314"><mml:math id="mml-ieqn-2314"><mml:mo>&#x2329;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo></mml:math></inline-formula> designate the time average operator, i.e.,</p>
<p><disp-formula id="eqn-432"><label>(432)</label><mml:math id="mml-eqn-432" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>T</mml:mi></mml:mfrac><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2315"><mml:math id="mml-ieqn-2315"><mml:mi>T</mml:mi></mml:math></inline-formula> designates the maximum time. The goal now is to find <inline-formula id="ieqn-2316"><mml:math id="mml-ieqn-2316"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> such that the time average of the square of the amplitude <inline-formula id="ieqn-2317"><mml:math id="mml-ieqn-2317"><mml:mo>&#x03B1;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></inline-formula> is maximized:</p>
<p><disp-formula id="eqn-433"><label>(433)</label><mml:math id="mml-eqn-433" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmax</mml:mtext></mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:munder><mml:mo stretchy="false">(</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmax</mml:mtext></mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:munder><mml:mo stretchy="false">(</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace"/></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-117">
<label>Figure 117</label>
<caption><title><italic>LSTM unit and BiLSTM unit</italic> (Sections <xref ref-type="sec" rid="s2_3_2">2.3.2</xref>, <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>, <xref ref-type="sec" rid="s7_2">7.2</xref>, <xref ref-type="sec" rid="s12_2">12.2</xref>). Each blue dot is an original LSTM unit (in <italic>folded</italic> form Figure <xref ref-type="fig" rid="fig-81">81</xref> in Section <xref ref-type="sec" rid="s7_2">7.2</xref>, without peepholes as shown in Figure <xref ref-type="fig" rid="fig-15">15</xref>), thus a single hidden layer. The above LSTM architecture (left) in <italic>unfolded</italic> form corresponds to Figure <xref ref-type="fig" rid="fig-82">82</xref>, with the inputs at state <inline-formula id="ieqn-1986"><mml:math id="mml-ieqn-1986"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> designated by <inline-formula id="ieqn-1987"><mml:math id="mml-ieqn-1987"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the corresponding outputs by <inline-formula id="ieqn-1988"><mml:math id="mml-ieqn-1988"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, for <inline-formula id="ieqn-1989"><mml:math id="mml-ieqn-1989"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo></mml:math></inline-formula>. In the BiLSTM architecture (right), there are two LSTM units in the hidden layer, with the forward flow of information in the bottom LSTM unit, and the backward flow in the top LSTM unit [<xref ref-type="bibr" rid="ref-26">26</xref>]. (Figure reproduced with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-117.tif"/>
</fig>
<p>which is equivalent to maximizing the amplitude <inline-formula id="ieqn-2318"><mml:math id="mml-ieqn-2318"><mml:mo>&#x03B1;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and thus the information content of <inline-formula id="ieqn-2319"><mml:math id="mml-ieqn-2319"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in <inline-formula id="ieqn-2320"><mml:math id="mml-ieqn-2320"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which in turn is also called &#x201C;coherent structure&#x201D;. The square of the amplitude in Eq. (<xref ref-type="disp-formula" rid="eqn-433">433</xref>) can be written as</p>
<p><disp-formula id="eqn-434"><label>(434)</label><mml:math id="mml-eqn-434" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03BB;</mml:mi><mml:mo>:=</mml:mo><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mrow><mml:mo>[</mml:mo><mml:munder><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>so that &#x03BB; is the component (or projection) of the term in square brackets in Eq. (<xref ref-type="disp-formula" rid="eqn-434">434</xref>) along the direction <inline-formula id="ieqn-2322"><mml:math id="mml-ieqn-2322"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The component &#x03BB; is maximal if this term in square brackets is colinear with (or &#x201C;parallel to&#x201D;) <inline-formula id="ieqn-2324"><mml:math id="mml-ieqn-2324"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e.,</p>
<p><disp-formula id="eqn-435"><label>(435)</label><mml:math id="mml-eqn-435" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:munder><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:munder><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is a <italic>continuous</italic> eigenvalue problem with the eigenpair being <inline-formula id="ieqn-2325"><mml:math id="mml-ieqn-2325"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. In practice, the dynamic quantity <inline-formula id="ieqn-2326"><mml:math id="mml-ieqn-2326"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is sampled at <inline-formula id="ieqn-2327"><mml:math id="mml-ieqn-2327"><mml:mi>k</mml:mi></mml:math></inline-formula> discrete times <inline-formula id="ieqn-2328"><mml:math id="mml-ieqn-2328"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>.</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> to produce <inline-formula id="ieqn-2329"><mml:math id="mml-ieqn-2329"><mml:mi>k</mml:mi></mml:math></inline-formula> snapshots, which are functions of <inline-formula id="ieqn-2330"><mml:math id="mml-ieqn-2330"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:math></inline-formula>, assumed to be linearly independent, and ordered in matrix form as follows:</p>
<p><disp-formula id="eqn-436"><label>(436)</label><mml:math id="mml-eqn-436" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>=:</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>=:</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The coherent structure <inline-formula id="ieqn-2331"><mml:math id="mml-ieqn-2331"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> can be expressed on the basis of the snapshots as</p>
<p><disp-formula id="eqn-437"><label>(437)</label><mml:math id="mml-eqn-437" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">&#x03B2;</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="2"><mml:mrow><mml:mo>&#x2219;</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>As a result of the discrete nature of Eq. (<xref ref-type="disp-formula" rid="eqn-436">436</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-437">437</xref>), the eigenvalue problem in Eq. (<xref ref-type="disp-formula" rid="eqn-435">435</xref>) is discretized into</p>
<p><disp-formula id="eqn-438"><label>(438)</label><mml:math id="mml-eqn-438" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">C</mml:mi><mml:mi mathvariant="bold-italic">&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mi mathvariant="bold-italic">&#x03B2;</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi><mml:mo>:=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mfrac><mml:munder><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>]</mml:mo><mml:mo>,</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the matrix <inline-formula id="ieqn-2332"><mml:math id="mml-ieqn-2332"><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:math></inline-formula> is symmetric, positive definite, leading to positive eigenvalues in <inline-formula id="ieqn-2333"><mml:math id="mml-ieqn-2333"><mml:mi>k</mml:mi></mml:math></inline-formula> eigenpairs <inline-formula id="ieqn-2334"><mml:math id="mml-ieqn-2334"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B2;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2335"><mml:math id="mml-ieqn-2335"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>. With <inline-formula id="ieqn-2336"><mml:math id="mml-ieqn-2336"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> now decomposed into <inline-formula id="ieqn-2337"><mml:math id="mml-ieqn-2337"><mml:mi>k</mml:mi></mml:math></inline-formula> linearly independent directions <inline-formula id="ieqn-2338"><mml:math id="mml-ieqn-2338"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> according to Eq. (<xref ref-type="disp-formula" rid="eqn-437">437</xref>), the dynamic quantity <inline-formula id="ieqn-2339"><mml:math id="mml-ieqn-2339"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-429">429</xref>) can now be written as a linear combination of <inline-formula id="ieqn-2340"><mml:math id="mml-ieqn-2340"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2341"><mml:math id="mml-ieqn-2341"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>, each with a different time-dependent amplitude <inline-formula id="ieqn-2342"><mml:math id="mml-ieqn-2342"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, i.e.,</p>
<p><disp-formula id="eqn-439"><label>(439)</label><mml:math id="mml-eqn-439" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is called a proper orthogonal decomposition of <inline-formula id="ieqn-2343"><mml:math id="mml-ieqn-2343"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and is recorded in Figure <xref ref-type="fig" rid="fig-18">18</xref> as &#x201C;Full POD reconstruction&#x201D;. Technically, Eq. (<xref ref-type="disp-formula" rid="eqn-437">437</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-439">439</xref>) are approximations of infinite-dimensional functions by a finite number of linearly-independent functions.</p>
<fig id="fig-118">
<label>Figure 118</label>
<caption><title><italic>LSTM/BiLSTM training strategy</italic> (Sections <xref ref-type="sec" rid="s12_2_1">12.2.1</xref>, <xref ref-type="sec" rid="s12_2_2">12.2.2</xref>). From the 1-D time series <inline-formula id="ieqn-1990"><mml:math id="mml-ieqn-1990"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of each dominant mode <inline-formula id="ieqn-1991"><mml:math id="mml-ieqn-1991"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, for <inline-formula id="ieqn-1992"><mml:math id="mml-ieqn-1992"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, use a moving window to extract thousands of samples <inline-formula id="ieqn-1993"><mml:math id="mml-ieqn-1993"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-1994"><mml:math id="mml-ieqn-1994"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-1995"><mml:math id="mml-ieqn-1995"><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> being the time of snapshot <inline-formula id="ieqn-1996"><mml:math id="mml-ieqn-1996"><mml:mi>k</mml:mi></mml:math></inline-formula>. Each sample is subdivided into an input signal <inline-formula id="ieqn-1997"><mml:math id="mml-ieqn-1997"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-1998"><mml:math id="mml-ieqn-1998"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> and an output signal <inline-formula id="ieqn-1999"><mml:math id="mml-ieqn-1999"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2000"><mml:math id="mml-ieqn-2000"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2001"><mml:math id="mml-ieqn-2001"><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-2002"><mml:math id="mml-ieqn-2002"><mml:mn>0</mml:mn><mml:mo>&lt;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The windows can be overlapping. These thousands of input/output pairs were then used to train LSTM-ROM networks, in which LSTM can be replaced by BiLSTM (Figure <xref ref-type="fig" rid="fig-117">117</xref>). The trained LSTM/BiLSTM-ROM networks were then used to predict <inline-formula id="ieqn-2003"><mml:math id="mml-ieqn-2003"><mml:msub><mml:mover accent='true'><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2004"><mml:math id="mml-ieqn-2004"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> of the test datasets, given <inline-formula id="ieqn-2005"><mml:math id="mml-ieqn-2005"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2006"><mml:math id="mml-ieqn-2006"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-26">26</xref>]. (Adapted with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-118.tif"/>
</fig>
<p>Usually, a subset of <inline-formula id="ieqn-2344"><mml:math id="mml-ieqn-2344"><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> POD modes are selected such that the error committed by truncating the basis as done in Eq. (<xref ref-type="disp-formula" rid="eqn-4">4</xref>) would be small compared to Eq. (<xref ref-type="disp-formula" rid="eqn-439">439</xref>), and recalled here for convenience:</p>
<p><disp-formula id="eqn-4a"><label>(4)</label><mml:math id="mml-eqn-4a" display="block"><mml:mrow><mml:mi>u</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2248;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:msub><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mtext>with&#x00A0;</mml:mtext><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi><mml:mtext>&#x00A0;and&#x00A0;</mml:mtext><mml:msub><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo>&#x007B;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x007D;</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>One way is to select the POD modes corresponding to the highest eigenvalues (or energies) in Eq. (<xref ref-type="disp-formula" rid="eqn-435">435</xref>); see Step (<xref ref-type="list" rid="L2">2</xref>) in Section <xref ref-type="sec" rid="s12_2_2">12.2.2</xref>.</p> 
<statement id="st12_1"><title>Remark 12.1.</title>
<p><italic>Reduced-order POD</italic>. Data for two physical problems were available from numerical simulations: (1) the Force Isotropic Turbulence (ISO) dataset, and (2) the Magnetohydrodynamic Turbulence (MHD) dataset [<xref ref-type="bibr" rid="ref-105">105</xref>]. For each physical problem, the authors of [<xref ref-type="bibr" rid="ref-26">26</xref>] employed <inline-formula id="ieqn-2345"><mml:math id="mml-ieqn-2345"><mml:mi>N</mml:mi><mml:mo>=</mml:mo><mml:mn>6</mml:mn></mml:math></inline-formula> equidistant 2-D planes (slices, Figure <xref ref-type="fig" rid="fig-116">116</xref>), with 5 of those 2-D planes used for training, and 1 remaining 2-D plane used for testing (see Section <xref ref-type="sec" rid="s6_1">6.1</xref>). The same sub-region of the 6 equidistant 2-D plane (yellow squares in Figure <xref ref-type="fig" rid="fig-116">116</xref>) was used to generate 6 training / testing datasets. For each training / testing dataset, <inline-formula id="ieqn-2346"><mml:math id="mml-ieqn-2346"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mn>023</mml:mn></mml:math></inline-formula> snapshots for the ISO dataset and <inline-formula id="ieqn-2347"><mml:math id="mml-ieqn-2347"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>024</mml:mn></mml:math></inline-formula> snapshots for the MHD dataset were used in [<xref ref-type="bibr" rid="ref-26">26</xref>]. The reason was because the ISO dataset contained 5,023 time steps, whereas the MHD dataset contained 1,024 time steps. So the number of snapshots was the same as the number of time steps. These snapshots were reduced to <inline-formula id="ieqn-2348"><mml:math id="mml-ieqn-2348"><mml:mi>m</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>10</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> POD modes with highest eigenvalues (thus energies), which were much fewer than the original <inline-formula id="ieqn-2349"><mml:math id="mml-ieqn-2349"><mml:mi>k</mml:mi></mml:math></inline-formula> snapshots, since <inline-formula id="ieqn-2350"><mml:math id="mml-ieqn-2350"><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>. See Remark <xref ref-type="statement" rid="st12_3">12.3</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-119">
<label>Figure 119</label>
<caption><title><italic>Two methods of developing LSTM-ROM</italic> (Section <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>, <xref ref-type="sec" rid="s12_2_2">12.2.2</xref>). For each physical problem (ISO or MHD): (a) Each dominant POD mode has a network to predict <inline-formula id="ieqn-368"><mml:math id="mml-ieqn-368"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-369"><mml:math id="mml-ieqn-369"><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup></mml:math></inline-formula>, given <inline-formula id="ieqn-370"><mml:math id="mml-ieqn-370"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>; (b) All <inline-formula id="ieqn-371"><mml:math id="mml-ieqn-371"><mml:mi>m</mml:mi></mml:math></inline-formula> POD dominant modes share the same network to predict <inline-formula id="ieqn-372"><mml:math id="mml-ieqn-372"><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x007D;</mml:mo></mml:mrow></mml:math></inline-formula>, with <inline-formula id="ieqn-373a"><mml:math id="mml-ieqn-373a"><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup></mml:math></inline-formula>, given <inline-formula id="ieqn-373"><mml:math id="mml-ieqn-373"><mml:mrow><mml:mo>&#x007B;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>&#x2026;</mml:mn><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x007D;</mml:mo></mml:mrow></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-26">26</xref>]. (Figure reproduced with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-119.tif"/>
</fig>
<statement id="st12_2"><title>Remark 12.2.</title>
<p><italic>Another method of finding the POD modes</italic> without forming the symmetric matrix <inline-formula id="ieqn-2351"><mml:math id="mml-ieqn-2351"><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-438">438</xref>) is by using the Singular Value Decomposition (SVD) directly on the rectangular matrix of the sampled snapshots, discrete in both space and time. The POD modes are then obtained from the left singular vectors times the corresponding singular values. A reduced POD basis is obtained next based on an information-content matrix. See [<xref ref-type="bibr" rid="ref-306">306</xref>], where POD was applied to efficiently solve nonlinear electromagnetic problems governed by Maxwell&#x2019;s equations with nonlinear hysteresis at low frequency (10 kHz), called static hysteresis, discretized by a finite-element method. See also Remark <xref ref-type="statement" rid="st12_4">12.4</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
</sec>
<sec id="s12_2"><label>12.2</label>
<title>POD with LSTM-Reduced-Order-Model</title>
<p>Typically, once the dominant POD modes of a physical problem (ISO or MHD) were identified, a reduced-order model (ROM) can be obtained by projecting the governing partial differential equations (PDEs) onto the basis of the dominant POD modes using, e.g., Galerkin projection (GP). Using this method, the authors of [<xref ref-type="bibr" rid="ref-306">306</xref>] employed full-order simulations of the governing electro-magnetic PDE with certain input excitation to generate POD modes, which were then used to project similar PDE with different parameters and solved for the coefficients <inline-formula id="ieqn-2352"><mml:math id="mml-ieqn-2352"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> under different input excitations.</p>
<sec id="s12_2_1"><label>12.2.1</label>
<title>Goal for using neural network</title>
<p>Instead of using GP on the dominant POD modes of a physical problem to solve for the coefficients <inline-formula id="ieqn-2353"><mml:math id="mml-ieqn-2353"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as described above, deep-learning neural network was used in [<xref ref-type="bibr" rid="ref-26">26</xref>] to predict the next value of <inline-formula id="ieqn-2354"><mml:math id="mml-ieqn-2354"><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula>, with <inline-formula id="ieqn-2355"><mml:math id="mml-ieqn-2355"><mml:mrow><mml:msup><mml:mi>t</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:math></inline-formula>, given the current value <inline-formula id="ieqn-2356"><mml:math id="mml-ieqn-2356"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, for <inline-formula id="ieqn-2357"><mml:math id="mml-ieqn-2357"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>.</p>
<p>To achive this goal, LSTM/BiLSTM networks (Figure <xref ref-type="fig" rid="fig-117">117</xref>) were trained using thousands of paired short input / output signals obtained by segmenting the time-dependent signal <inline-formula id="ieqn-2358"><mml:math id="mml-ieqn-2358"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the dominant POD mode <inline-formula id="ieqn-2359"><mml:math id="mml-ieqn-2359"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-26">26</xref>].</p>
<fig id="fig-120">
<label>Figure 120</label>
<caption><title><italic>Hurst exponent vs POD-mode rank for Isotropic Turbulence (ISO)</italic> (Sections <xref ref-type="sec" rid="s12_3">12.3</xref>). POD modes with larger eigenvalues (Eq. (<xref ref-type="disp-formula" rid="eqn-438">438</xref>)) are higher ranked, and have lower rank number, e.g., POD mode rank 7 has larger eigenvalue, and thus more dominant, than POD mode rank 50. The Hurst exponent, even though fluctuating, trends downward with the POD mode rank, but not monotonically, i.e., for two POD modes sufficiently far apart (e.g., mode 7 and mode 50), a POD mode with lower rank generally has a lower Hurst exponent. The first 800 POD modes for the ISO problem have the Hurst exponents higher than 0.5, and are thus persistent [<xref ref-type="bibr" rid="ref-26">26</xref>]. (Adapted with permission of the author.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-120.tif"/>
</fig>
</sec>
<sec id="s12_2_2"><label>12.2.2</label>
<title>Data generation, training and testing procedure</title>
<p>The following procedure was adopted in [<xref ref-type="bibr" rid="ref-26">26</xref>] to develop their LSTM-ROM for two physical problems, ISO and MHD; see Remark <xref ref-type="statement" rid="st12_1">12.1</xref>. For each of the two physical problems (ISO and MHD), the following steps were used:</p>
<list id="L2" list-type="simple">
<list-item><p>(1) From the 3-D computational domain of a physical problem (ISO or MHD), select <inline-formula id="ieqn-2360"><mml:math id="mml-ieqn-2360"><mml:mi>N</mml:mi></mml:math></inline-formula> equidistant 2-D planes that slice through this 3-D domain, and select the same subregion for all of these planes, the majority of which is used for the training datasets, and the rest for the test datasets (Figure <xref ref-type="fig" rid="fig-116">116</xref> and Remark <xref ref-type="statement" rid="st12_1">12.1</xref> for the actual value of <inline-formula id="ieqn-2361"><mml:math id="mml-ieqn-2361"><mml:mi>N</mml:mi></mml:math></inline-formula> and the number of training datasets and test datasets employed in [<xref ref-type="bibr" rid="ref-26">26</xref>]).</p></list-item>
<list-item><p>(2) For each of the training datasets and test datasets, extract from <inline-formula id="ieqn-2362"><mml:math id="mml-ieqn-2362"><mml:mi>k</mml:mi></mml:math></inline-formula> snapshot a few <inline-formula id="ieqn-2363"><mml:math id="mml-ieqn-2363"><mml:mi>m</mml:mi><mml:mo>&#x226A;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> dominant POD modes <inline-formula id="ieqn-2364"><mml:math id="mml-ieqn-2364"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2365"><mml:math id="mml-ieqn-2365"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> (with the highest energies / eigenvalues) and their corresponding coefficients <inline-formula id="ieqn-2366"><mml:math id="mml-ieqn-2366"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2367"><mml:math id="mml-ieqn-2367"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, by solving the eigenvalue problem Eq. (<xref ref-type="disp-formula" rid="eqn-438">438</xref>), then use Eq. (<xref ref-type="disp-formula" rid="eqn-437">437</xref>) to obtain the POD modes <inline-formula id="ieqn-2368"><mml:math id="mml-ieqn-2368"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, and Eq. (<xref ref-type="disp-formula" rid="eqn-431">431</xref>) to obtain <inline-formula id="ieqn-2369"><mml:math id="mml-ieqn-2369"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for use in Step (<xref ref-type="list" rid="L2">3</xref>).</p></list-item>
<list-item><p>(3) The time series of the coefficient <inline-formula id="ieqn-2370"><mml:math id="mml-ieqn-2370"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the dominant POD mode <inline-formula id="ieqn-2371"><mml:math id="mml-ieqn-2371"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of a <italic>training</italic> dataset is chunked into thousands of small samples with <inline-formula id="ieqn-2372"><mml:math id="mml-ieqn-2372"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2373"><mml:math id="mml-ieqn-2373"><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is the time of the <italic>kth</italic> snapshot, by a moving window. Each sample is subdivided into two parts: The input part with time length <inline-formula id="ieqn-2374"><mml:math id="mml-ieqn-2374"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the output part with time length <inline-formula id="ieqn-2375"><mml:math id="mml-ieqn-2375"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, Figure <xref ref-type="fig" rid="fig-118">118</xref>. These thousands of input/output pairs were then used to train LSTM/BiLSTM networks in Step (<xref ref-type="list" rid="L2">4</xref>). See Remark <xref ref-type="statement" rid="st12_3">12.3</xref>.</p></list-item>
<list-item><p>(4) Use the input/output pairs generated from the training datasets in Step (<xref ref-type="list" rid="L2">3</xref>) to train LSTM/BiLSTM-ROM networks. Two methods were considered in [<xref ref-type="bibr" rid="ref-26">26</xref>]:</p>
<list list-type="simple">
<list-item><label>(a)</label><p><italic>Multiple-network method:</italic> Use a separate RNN for each of the <inline-formula id="ieqn-2376"><mml:math id="mml-ieqn-2376"><mml:mi>m</mml:mi></mml:math></inline-formula> dominant POD modes to separately predict the coefficient <inline-formula id="ieqn-2377"><mml:math id="mml-ieqn-2377"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, for <inline-formula id="ieqn-2378"><mml:math id="mml-ieqn-2378"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, given a sample input, see Figure <xref ref-type="fig" rid="fig-119">119a</xref> and Figure <xref ref-type="fig" rid="fig-118">118</xref>. Hyperparameters (layer width, learning rate, batch size) are tuned for the most dominant POD mode and reused for training the other neural networks.</p>
</list-item>
<list-item><label>(b)</label><p><italic>Single-network method:</italic> Use the same RNN to predict the coefficients <inline-formula id="ieqn-2379"><mml:math id="mml-ieqn-2379"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2380"><mml:math id="mml-ieqn-2380"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> of all <inline-formula id="ieqn-2381"><mml:math id="mml-ieqn-2381"><mml:mi>m</mml:mi></mml:math></inline-formula> dominant POD modes at once, given a sample input, see Figure <xref ref-type="fig" rid="fig-119">119b</xref> and Figure <xref ref-type="fig" rid="fig-118">118</xref>.</p>
</list-item></list>
<p>The single-network method better captures the inter-modal interactions that describe the energy transfer from larger to smaller scales. Vortices that spread over multiple dominant POD modes also support the single-network method, which does not artificially constrain flow features to separate POD modes.</p></list-item>
<list-item><p>(5) Validation: Input/output pairs similar to those for training in Step (<xref ref-type="list" rid="L2">3</xref>) were generated from the <italic>test</italic> dataset for validation.<xref ref-type="fn" rid="fn279"><sup>279</sup></xref><fn id="fn279"><label>279</label><p>The authors of [<xref ref-type="bibr" rid="ref-26">26</xref>] did not have a validation dataset as defined in Section <xref ref-type="sec" rid="s6_1">6.1</xref>.</p></fn> With a short time series of the coefficient <inline-formula id="ieqn-2382"><mml:math id="mml-ieqn-2382"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the dominant POD mode <inline-formula id="ieqn-2383"><mml:math id="mml-ieqn-2383"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the <italic>test</italic> dataset, with <inline-formula id="ieqn-2384"><mml:math id="mml-ieqn-2384"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2385"><mml:math id="mml-ieqn-2385"><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is the time of the <italic>kth</italic> snapshot, and <inline-formula id="ieqn-2386"><mml:math id="mml-ieqn-2386"><mml:mn>0</mml:mn><mml:mo>&lt;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as input, use the trained LSTM/BiLSTM networks in Step (<xref ref-type="list" rid="L2">4</xref>) to predict the values of <inline-formula id="ieqn-2387"><mml:math id="mml-ieqn-2387"><mml:msub><mml:mover accent='true'><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2388"><mml:math id="mml-ieqn-2388"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2389"><mml:math id="mml-ieqn-2389"><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the time at the end of the sample, such that the sample length is <inline-formula id="ieqn-2390"><mml:math id="mml-ieqn-2390"><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-2391"><mml:math id="mml-ieqn-2391"><mml:mn>0</mml:mn><mml:mo>&lt;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>; see Figure <xref ref-type="fig" rid="fig-118">118</xref>. Compute the error between the predicted value <inline-formula id="ieqn-2392"><mml:math id="mml-ieqn-2392"><mml:msub><mml:mover accent='true'><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and the target value of <inline-formula id="ieqn-2393"><mml:math id="mml-ieqn-2393"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> from the test dataset. Repeat the same prediction procedure for all <inline-formula id="ieqn-2394"><mml:math id="mml-ieqn-2394"><mml:mi>m</mml:mi></mml:math></inline-formula> dominant POD modes chosen in Step (<xref ref-type="list" rid="L2">2</xref>). See Remark <xref ref-type="statement" rid="st12_3">12.3</xref>.</p></list-item>
<list-item><p>(6) At time <inline-formula id="ieqn-2395"><mml:math id="mml-ieqn-2395"><mml:mi>t</mml:mi></mml:math></inline-formula>, use the predicted coefficients <inline-formula id="ieqn-2396"><mml:math id="mml-ieqn-2396"><mml:msub><mml:mover accent='true'><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> together with the dominant POD modes <inline-formula id="ieqn-2397"><mml:math id="mml-ieqn-2397"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, for <inline-formula id="ieqn-2398"><mml:math id="mml-ieqn-2398"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, to compute the flow field dynamics <inline-formula id="ieqn-2399"><mml:math id="mml-ieqn-2399"><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> using Eq. (<xref ref-type="disp-formula" rid="eqn-4">4</xref>).</p></list-item></list>
</sec>
</sec>
<sec id="s12_3"><label>12.3</label>
<title>Memory effects of POD coefficients on LSTM models</title>
<p>The results and deep-learning concepts used in [<xref ref-type="bibr" rid="ref-26">26</xref>] were presented in the motivational Section <xref ref-type="sec" rid="s2_3_3">2.3.3</xref> above, Figure <xref ref-type="fig" rid="fig-20">20</xref>. In this section, we discuss some details of the formulation.</p> 
<statement id="st12_3"><title>Remark 12.3.</title>
<p>Even though the value of <inline-formula id="ieqn-2400"><mml:math id="mml-ieqn-2400"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:math></inline-formula> was given in [<xref ref-type="bibr" rid="ref-26">26</xref>] as an example, all LSTM/BiLSTM networks in their numerical examples were trained using input/output pairs with <inline-formula id="ieqn-2401"><mml:math id="mml-ieqn-2401"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>10</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula>, i.e., 10 time steps for both input and output samples. With the overall simulation time for both physical problems (ISO and MHD) being <inline-formula id="ieqn-2402"><mml:math id="mml-ieqn-2402"><mml:mn>2.056</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>s</mml:mi></mml:math></inline-formula>, the time step size is <inline-formula id="ieqn-2403"><mml:math id="mml-ieqn-2403"><mml:mo>&#x0394;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>ISO</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>2.056</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>5023</mml:mn><mml:mo>=</mml:mo><mml:mn>4.1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:math></inline-formula> for ISO, and <inline-formula id="ieqn-2404"><mml:math id="mml-ieqn-2404"><mml:mo>&#x0394;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>MHD</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>2.056</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>1024</mml:mn><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>5</mml:mn><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>ISO</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> for MHD. See also Remark <xref ref-type="statement" rid="st12_1">12.1</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>The &#x201C;U-velocity field for all results&#x201D; was mentioned in [<xref ref-type="bibr" rid="ref-26">26</xref>], but without a definition of &#x201C;U-velocity&#x201D;, which was possibly the <inline-formula id="ieqn-2405"><mml:math id="mml-ieqn-2405"><mml:mi>x</mml:mi></mml:math></inline-formula> component of the 2-D velocity field in the 2-D planes (slices) used for the datasets, with &#x201C;V-velocity&#x201D; being the corresponding <inline-formula id="ieqn-2406"><mml:math id="mml-ieqn-2406"><mml:mi>y</mml:mi></mml:math></inline-formula> components.</p>
<p>The BiLSTM networks in the numerical examples were not as accurate as the LSTM networks for both physical problems (ISO and MHD), despite having more computations; see Figure <xref ref-type="fig" rid="fig-117">117</xref> above and Figure <xref ref-type="fig" rid="fig-20">20</xref> in Section <xref ref-type="sec" rid="s2_3_3">2.3.3</xref>. The authors of [<xref ref-type="bibr" rid="ref-26">26</xref>] conjectured that a reason could be due to the randomness nature of turbulent flows, as opposed to the high long-term correlation found in natural human languages, for which BiLSTM was designed to address.</p>
<p>Since LSTM architecture was designed specifically for sequential data with memory, it was sought in [<xref ref-type="bibr" rid="ref-26">26</xref>] to quantify whether there was &#x201C;memory&#x201D; (or persistence) in the time series of the coefficients <inline-formula id="ieqn-2407"><mml:math id="mml-ieqn-2407"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the POD mode <inline-formula id="ieqn-2408"><mml:math id="mml-ieqn-2408"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. To this end, the Hurst exponent<xref ref-type="fn" rid="fn280"><sup>280</sup></xref><fn id="fn280"><label>280</label><p>&#x201C;Hurst exponent&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Hurst_exponent&amp;oldid=986283721">version 22:09, 30 October 2020</ext-link>.</p></fn> was used to quantify the presence or absence of long-term memory in the time series <inline-formula id="ieqn-2409"><mml:math id="mml-ieqn-2409"><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-441"><label>(441)</label><mml:math id="mml-eqn-441" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>:=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:munder><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mtext>fixed</mml:mtext></mml:mrow></mml:mrow></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>R</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mtext>stdev</mml:mtext></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2410"><mml:math id="mml-ieqn-2410"><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a sequence of <inline-formula id="ieqn-2411"><mml:math id="mml-ieqn-2411"><mml:mi>n</mml:mi></mml:math></inline-formula> steps of <inline-formula id="ieqn-2412"><mml:math id="mml-ieqn-2412"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, starting at snapshot time <inline-formula id="ieqn-2413"><mml:math id="mml-ieqn-2413"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2414"><mml:math id="mml-ieqn-2414"><mml:mi mathvariant="double-struck">E</mml:mi></mml:math></inline-formula> is the expectation of the ratio of range <inline-formula id="ieqn-2415"><mml:math id="mml-ieqn-2415"><mml:msub><mml:mi>R</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> over standard deviation <inline-formula id="ieqn-2416"><mml:math id="mml-ieqn-2416"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> for many samples <inline-formula id="ieqn-2417"><mml:math id="mml-ieqn-2417"><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with different values of <inline-formula id="ieqn-2418"><mml:math id="mml-ieqn-2418"><mml:mi>k</mml:mi></mml:math></inline-formula> keeping <inline-formula id="ieqn-2419"><mml:math id="mml-ieqn-2419"><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> fixed, <inline-formula id="ieqn-2420"><mml:math id="mml-ieqn-2420"><mml:msub><mml:mi>R</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the range of sequence <inline-formula id="ieqn-2421"><mml:math id="mml-ieqn-2421"><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2422"><mml:math id="mml-ieqn-2422"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the standard deviation of sequence <inline-formula id="ieqn-2423"><mml:math id="mml-ieqn-2423"><mml:msub><mml:mrow><mml:mi mathvariant="bold">&#x1D4AE;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-2424"><mml:math id="mml-ieqn-2424"><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> a constant, <inline-formula id="ieqn-2425"><mml:math id="mml-ieqn-2425"><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> the Hurst exponent for POD mode <inline-formula id="ieqn-2426"><mml:math id="mml-ieqn-2426"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<p>A Hurst coefficient of <inline-formula id="ieqn-2427"><mml:math id="mml-ieqn-2427"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> indicates persistent behavior, i.e., an upward trend in a sequence is followed by an upward trend. If <inline-formula id="ieqn-2428"><mml:math id="mml-ieqn-2428"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> the behavior that is represented by time series data is anti-persistent, i.e., an upward trend is followed by a downward trend (and vice versa). The case <inline-formula id="ieqn-2429"><mml:math id="mml-ieqn-2429"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula> indicates random behavior, which implies a lack of memory in the underlying process.</p>
<p>The effects of prediction horizon and persistence on the prediction accuracy of LSTM network were studied in [<xref ref-type="bibr" rid="ref-26">26</xref>]. Horizon is the number of steps after the input sample that a LSTM model would predict the values of <inline-formula id="ieqn-2430"><mml:math id="mml-ieqn-2430"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and is proportional to the output time length <inline-formula id="ieqn-2431"><mml:math id="mml-ieqn-2431"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, assuming constant time step size. Persistence refers to the amount of correlation among subsequent realizations within sequential data, i.e., the presence of the long-term memory.</p>
<p>To this end, they selected one dataset (training or testing), and followed the multiple-network method in Step (<xref ref-type="list" rid="L2">4</xref>) of Section <xref ref-type="sec" rid="s12_2_2">12.2.2</xref> to develop a different LSTM network model for each POD mode with &#x201C;non-negligible eigenvalue&#x201D;. For both the ISO (Figure <xref ref-type="fig" rid="fig-120">120</xref>) and MHD problems, the 800 highest ranked POD modes were used.</p>
<p>A baseline horizon of 10 steps were used, for which the prediction errors were <inline-formula id="ieqn-2432"></inline-formula> <inline-formula id="ieqn-2433"><mml:math id="mml-ieqn-2433"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>08</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>37</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>94</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>36</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>03</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>00</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> for POD ranks <inline-formula id="ieqn-2434"><mml:math id="mml-ieqn-2434"><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>7</mml:mn><mml:mo>,</mml:mo><mml:mn>15</mml:mn><mml:mo>,</mml:mo><mml:mn>50</mml:mn><mml:mo>,</mml:mo><mml:mn>100</mml:mn><mml:mo>,</mml:mo><mml:mn>400</mml:mn><mml:mo>,</mml:mo><mml:mn>800</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, and for Hurst exponents <inline-formula id="ieqn-2436"><mml:math id="mml-ieqn-2436"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>78</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>76</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>652</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>653</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>56</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>54</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, respectively. So the prediction error increased from 1.08 for POD rank 7 to 2.36 for POD rank 100, then decreased slightly for POD ranks 400 and 800. The time histories of the corresponding coefficients <inline-formula id="ieqn-2437"><mml:math id="mml-ieqn-2437"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2438"><mml:math id="mml-ieqn-2438"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow></mml:math></inline-formula> on the right of Figure <xref ref-type="fig" rid="fig-120">120</xref> provided only some qualitative comparison between the predicted values and the true values, but did not provide the scale of the actual magnitude of <inline-formula id="ieqn-2439"><mml:math id="mml-ieqn-2439"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, nor the time interval of these plots. For example, the magnitude of <inline-formula id="ieqn-2440"><mml:math id="mml-ieqn-2440"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mn>800</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> could be very small compared to that of <inline-formula id="ieqn-2441"><mml:math id="mml-ieqn-2441"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mn>7</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, given that the POD modes were normalized in the sense of Eq. (<xref ref-type="disp-formula" rid="eqn-431">431</xref>). Qualitatively, the predicted values compared well with the true values for POD ranks 7, 15, 50. So even though there was a divergence between the predicted value and the true value for POD ranks 100 and 800, but if the magnitude of <inline-formula id="ieqn-2442"><mml:math id="mml-ieqn-2442"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mn>100</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2443"><mml:math id="mml-ieqn-2443"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mn>800</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> were very small compared to those of the dominant POD modes, there would be less of a concern.</p>
<p>Another expected result was that for the same POD mode rank lower than 50, the error increased dramatically with the prediction horizon. For example, for POD rank 7, the errors were <inline-formula id="ieqn-2445"><mml:math id="mml-ieqn-2445"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>08</mml:mn><mml:mo>,</mml:mo><mml:mn>6</mml:mn><mml:mo>.</mml:mo><mml:mn>86</mml:mn><mml:mo>,</mml:mo><mml:mn>15</mml:mn><mml:mo>.</mml:mo><mml:mn>03</mml:mn><mml:mo>,</mml:mo><mml:mn>19</mml:mn><mml:mo>.</mml:mo><mml:mn>42</mml:mn><mml:mo>,</mml:mo><mml:mn>21</mml:mn><mml:mo>.</mml:mo><mml:mn>40</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> for prediction horizons of <inline-formula id="ieqn-2446"><mml:math id="mml-ieqn-2446"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>10</mml:mn><mml:mo>,</mml:mo><mml:mn>25</mml:mn><mml:mo>,</mml:mo><mml:mn>50</mml:mn><mml:mo>,</mml:mo><mml:mn>75</mml:mn><mml:mo>,</mml:mo><mml:mn>100</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> steps, respectively. Thus the error (at 21.40) for POD rank 7 with horizon 100 steps was more than ten times higher than the error (at 2.00) for POD rank 800 with horizon 10 steps. For POD mode rank 50 and higher, the error did not increase as much with the horizon, but stayed at about the same order of magnitude as the error for horizon 25 steps.</p>
<p>A final note is whether the above trained LSTM-ROM networks could produce accurate prediction for flow dynamics with different parameters such as Reynolds number, mass density, viscosity, geometry, initial conditions, etc., particularly that both the ISO and MHD datasets were created for a single Reynolds number, which was not mentioned in [<xref ref-type="bibr" rid="ref-26">26</xref>], who mentioned that their method (and POD in general) would work for a &#x201C;narrow range of Reynolds numbers&#x201D; for which the flow dynamics is qualitatively similar, and for &#x201C;simplified flow fields and geometries&#x201D;.</p> 
<statement id="st12_4"><title>Remark 12.4.</title>
<p><italic>Use of POD-ROM for different systems</italic>. The authors of [<xref ref-type="bibr" rid="ref-306">306</xref>] studied The flexibility of POD reduced-order models to solve nonlinear electromagnetic problems by varying the excitation form (e.g., square wave instead of sine wave) and by using the undamped (without the first-order time derivative term) snapshots in the simulation of the damped case (with the first-order time derivative term). They demonstrated via numerical examples involving nonlinear power-magnetic-component simulations that the reduced-order models by POD are quite flexible and robust. See also Remark <xref ref-type="statement" rid="st12_2">12.2</xref>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement> 
<statement id="st12_5"><title>Remark 12.5.</title>
<p><italic>pyMOR - Model Order Reduction with Python</italic>. Finally, we mention the software pyMOR,<xref ref-type="fn" rid="fn281"><sup>281</sup></xref><fn id="fn281"><label>281</label><p>See pyMOR website <ext-link ext-link-type="uri" xlink:href="https://pymor.org/">https://pymor.org/</ext-link>.</p></fn> which is &#x201C;a software library for building model order reduction applications with the Python programming language. Implemented algorithms include reduced basis methods for parametric linear and non-linear problems, as well as system-theoretic methods such as balanced truncation or IRKA (Iterative Rational Krylov Algorithm). All algorithms in pyMOR are formulated in terms of abstract interfaces for seamless integration with external PDE (Partial Differential Equation) solver packages. Moreover, pure Python implementations of FEM (Finite Element Method) and FVM (Finite Volume Method) discretizations using the NumPy/SciPy scientific computing stack are provided for getting started quickly.&#x201D; It is noted that pyMOR includes POD and &#x201C;Model order reduction with artificial neural networks&#x201D;, among many other methods; see the documentation of pyMOR. Clearly, this software tool would be applicable to many physical problems, e.g., solids, structures, fluids, electromagnetics, coupled electro-thermal simulation, etc.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<fig id="fig-121">
<label>Figure 121</label>
<caption><title><italic>Space-time solution of inviscid 1D-Burgers&#x2019; equation</italic> (Section <xref ref-type="sec" rid="s12_4_1">12.4.1</xref>). The solution shows a characteristic steep spatial gradient, which shifts and further steepens in the course of time. The FOM solution (left) and the solution of the proposed hyper-reduced ROM (center), in which the solution subspace is represented by a nonlinear manifold in the form of a feedforward neural network (Section <xref ref-type="sec" rid="s4">4</xref>) (NM-LSPG-HR), show an excellent agreement, whereas the spatial gradient is significantly blurred in the solution obtained with a hyper-reduced ROM based on a linear subspace (LS-LSPG-HR) (right). The FOM is a obtained upon a finite-difference approximation in the spatial domain with <inline-formula id="ieqn-2007"><mml:math id="mml-ieqn-2007"><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>1001</mml:mn></mml:math></inline-formula> grid points (i.e., degrees of freedom); the backward Euler scheme is employed for time-integration using a step size <inline-formula id="ieqn-2008"><mml:math id="mml-ieqn-2008"><mml:mo>&#x0394;</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0078;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. Both ROMs only use <inline-formula id="ieqn-2009"><mml:math id="mml-ieqn-2009"><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula> generalized coordinates. The NM-LSPG-HR achieves a maximum relative error of less than 1 %, while the maximum relative error of the LS-LSPG-HR is approximately 6 %. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-121.tif"/>
</fig>
</sec>
<sec id="s12_4"><label>12.4</label>
<title>Reduced order models and hyper-reduction</title>
<p>Reduction of computational expense also is the main aim of <italic>nonlinear manifold reduced-order models</italic> (NM-ROMs), which have recently been proposed in [<xref ref-type="bibr" rid="ref-47">47</xref>],<xref ref-type="fn" rid="fn282"><sup>282</sup></xref><fn id="fn282"><label>282</label><p>Note that the authors also published a second condensed, but otherwise similar version of their article, see [<xref ref-type="bibr" rid="ref-48">48</xref>].</p></fn> where the approach belongs to the class of <italic>projection-based methods</italic>. Projection-based methods rely on the idea that solutions to physical simulations lie in a subspace of small dimensionality as compared to the dimensionality of high-fidelity models, which we obtain upon discretization (e.g., by finite elements) of the governing equations. In classical projection-based methods, such <italic>&#x201C;intrinsic solution subspaces&#x201D;</italic> are spanned by a set of appropriate basis vectors that capture the essential features of the full-order model (FOM), i.e., the subspace is assumed to be linear. We refer to [<xref ref-type="bibr" rid="ref-307">307</xref>] for a survey on projection-based linear subspace methods for parametric systems.<xref ref-type="fn" rid="fn283"><sup>283</sup></xref><fn id="fn283"><label>283</label><p>As opposed to the classical <italic>non-parametric</italic> case, in which all parameters are fixed, <italic>parametric</italic> methods aim at creating ROMs which account for (certain) parameters of the underlying governing equations to vary in some given range. Optimization of large-scale systems, for which repeated evaluations are computationally intractable, is a classical use-case for methods in parametric MOR, see Benner <italic>et al</italic>. [<xref ref-type="bibr" rid="ref-307">307</xref>].</p></fn></p>
<sec id="s12_4_1"><label>12.4.1</label>
<title>Motivating example: 1D Burger&#x2019;s equation</title>
<p>The effectiveness of linear subspace methods is directly related to the dimensionality of the basis to represent solutions with sufficient accuracy. Advection-dominated problems and problems with solutions that exhibit large (&#x201C;sharp&#x201D;) gradients, however, are characterized by a large Kolmogorov <inline-formula id="ieqn-2447"><mml:math id="mml-ieqn-2447"><mml:mi>n</mml:mi></mml:math></inline-formula>-width,<xref ref-type="fn" rid="fn284"><sup>284</sup></xref><fn id="fn284"><label>284</label><p>Mathematically, the dimensionality of linear subspace that &#x2018;best&#x2019; approximates a nonlinear manifold is described by the <italic>Kolmogorov <inline-formula id="ieqn-3323"><mml:math id="mml-ieqn-3323"><mml:mi>n</mml:mi></mml:math></inline-formula>-width</italic>, see, e.g., [<xref ref-type="bibr" rid="ref-308">308</xref>] for a formal definition.</p></fn> which is adverse to linear subspace methods. As examples, The authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] mention hyperbolic equations with large Reynolds number, Boltzmann transport equations and traffic flow simulations. Many approaches to construct efficient ROMs in adverse problems are based on the idea to enhance the &#x201C;solution representability&#x201D; of the linear subspace, e.g., by using adaptive schemes tailored to particular problems. Such problem-specific approaches, however, suffer from limited generality and a the necessity of a-priori knowledge as, e.g., the (spatial) direction of advection. In view of these drawbacks, the transition to a solution representation by <italic>nonlinear manifolds</italic> rather than using linear subspaces in projection-based ROMs was advocated in [<xref ref-type="bibr" rid="ref-47">47</xref>].</p>
<p>Burger&#x2019;s equation serves as a common prototype problem in numerical methods for nonlinear partial differential equations (PDEs) and MOR, in particular. The inviscid Burgers&#x2019; equation in one spatial dimension is given by</p>
<p><disp-formula id="eqn-442"><label>(442)</label><mml:math id="mml-eqn-442" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Burgers&#x2019; equation is a first-order hyperbolic equation which admits the formation of shock waves, i.e., regions with steep gradients in field variables, which propagate in the domain of interest. Its left-hand side corresponds to material time-derivatives in Eulerian descriptions of continuum mechanics. In the balance of linear momentum, for instance, the field <inline-formula id="ieqn-2448"><mml:math id="mml-ieqn-2448"><mml:mi>u</mml:mi></mml:math></inline-formula> represents the velocity field. Periodic boundary conditions and non-homogeneous initial conditions were assumed [<xref ref-type="bibr" rid="ref-47">47</xref>]:</p>
<p><disp-formula id="eqn-443"><label>(443)</label><mml:math id="mml-eqn-443" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mfrac><mml:mi>&#x03BC;</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mi>&#x03C0;</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mn>0</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The above initial conditions are governed by a scalar parameter <inline-formula id="ieqn-2449"><mml:math id="mml-ieqn-2449"><mml:mi>&#x00B5;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0.9</mml:mn><mml:mo>,</mml:mo><mml:mn>1.1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. In parametric MOR, ROMs are meant to be valid for not just a single value of the parameter <inline-formula id="ieqn-2450"><mml:math id="mml-ieqn-2450"><mml:mi>&#x00B5;</mml:mi></mml:math></inline-formula>, but it is supposed to be valid for a range of values of <inline-formula id="ieqn-2451"><mml:math id="mml-ieqn-2451"><mml:mi>&#x00B5;</mml:mi></mml:math></inline-formula> in the domain <inline-formula id="ieqn-2452"><mml:math id="mml-ieqn-2452"><mml:mi>&#x1D49F;</mml:mi></mml:math></inline-formula>. For this purpose, the reduced-order space, irrespective of whether a linear subspace of FOM or a nonlinear manifold is used, is typically constructed using data obtained for different values of the parameter <inline-formula id="ieqn-2453"><mml:math id="mml-ieqn-2453"><mml:mi>&#x00B5;</mml:mi></mml:math></inline-formula>. In the example in [<xref ref-type="bibr" rid="ref-47">47</xref>], the solutions for the parameter set <inline-formula id="ieqn-2454"><mml:math id="mml-ieqn-2454"><mml:mi>&#x00B5;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mn>0.9</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mn>1.1</mml:mn></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> were used to construct the individual ROMs. Note that solution data corresponding to the parameter value <inline-formula id="ieqn-2455"><mml:math id="mml-ieqn-2455"><mml:mi>&#x00B5;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, for which the ROMs were evaluated, was not used in this example.</p>
<p>Figure <xref ref-type="fig" rid="fig-121">121</xref> shows a zoomed view of solutions to the above problem obtained with the FOM (left), the proposed nonlinear manifold-based ROM (NM-LSPG-HR, center) and a conventional ROM, in which the full-order solution is represented by a linear subspace. The initial solution is characterized by a &#x201C;bump&#x201D; in the left half of the domain, which is centered at <inline-formula id="ieqn-2456"><mml:math id="mml-ieqn-2456"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:math></inline-formula>. The advective nature of Burgers&#x2019; problem causes the bump to move right, which also results in slopes of the bump to increase in movement direction but decrease on the averted side. The zoomed view in Figure <xref ref-type="fig" rid="fig-121">121</xref> shows the region at the end (<inline-formula id="ieqn-2457"><mml:math id="mml-ieqn-2457"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>) of the considered time-span, in which the (negative) gradient of the solution has already steepened significantly. With as little as <inline-formula id="ieqn-2458"><mml:math id="mml-ieqn-2458"><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula> generalized coordinates, the proposed nonlinear manifold-based approach (NM-LSPG-HR) succeeds in reproducing the FOM solution, which is obtained by a finite-difference approximation of the spatial domain using <inline-formula id="ieqn-2459"><mml:math id="mml-ieqn-2459"><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mn>1001</mml:mn></mml:mrow></mml:math></inline-formula> grid points; time-integration is performed by means of the backward Euler scheme with a constant step size of <inline-formula id="ieqn-2460"><mml:math id="mml-ieqn-2460"><mml:mo>&#x0394;</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0078;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, which translates into a total of <inline-formula id="ieqn-2461"><mml:math id="mml-ieqn-2461"><mml:msub><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>500</mml:mn></mml:math></inline-formula> time steps.</p>
<p>The ROM based on a linear subspace of the full-order solution (LS-LSPG-HR) fails to accurately reproduce the steep spatial gradient that develops over time, see Figure <xref ref-type="fig" rid="fig-121">121</xref>, right Instead, the bump is substantially blurred in the linear subspace-based ROM as compared to the FOM&#x2019;s solution (left). The maximum error over all time steps <inline-formula id="ieqn-2462"><mml:math id="mml-ieqn-2462"><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:math></inline-formula> relative to the full-order solution defined as:</p>
<p><disp-formula id="eqn-444"><label>(444)</label><mml:math id="mml-eqn-444" display="block"><mml:mrow><mml:mtext>Maximum&#x00A0;relative&#x00A0;error</mml:mtext><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mi>max</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo>&#x007B;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>...</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x007D;</mml:mo></mml:mrow></mml:munder><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2463"><mml:math id="mml-ieqn-2463"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2464"><mml:math id="mml-ieqn-2464"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> denote FOM and ROM solution vectors, respectively, was considered in [<xref ref-type="bibr" rid="ref-47">47</xref>]. In terms of the above metric, the proposed nonlinear-manifold-based ROM achieves a maximum error of approximately 1%, whereas the linear-subspace-based ROM shows a maximum relative error of 6%. For the given problem, the linear-subspace-based ROM is approximately 5 to 6 times faster than the FOM. The nonlinear-manifold-based ROM, however, does not achieve any speed unless <italic>hyper-reduction</italic> is employed. Hyper-reduction (HR) methods provide means to efficiently evaluated nonlinear terms in ROMs without evaluations of the FOM (see Hyper-reduction, Section <xref ref-type="sec" rid="s12_4_4">12.4.4</xref>). Using hyper-reduction, a factor 2 speed-up is achieved for both the nonlinear-manifold and linear-subspace-based ROMs, i.e., the effective speed-ups amount to factors of 2 and 9&#x2013;10, respectively.</p>
<p>The solution manifold can be represented by means of a shallow, sparsely connected feed-forward neural network [<xref ref-type="bibr" rid="ref-47">47</xref>] (see Statics, feedforward networks, Sections <xref ref-type="sec" rid="s4">4</xref>, <xref ref-type="sec" rid="s4_6">4.6</xref>). The network is trained in an unsupervised manner using the concept of autoencoders (see Autoencoder, Section <xref ref-type="sec" rid="s12_4_3">12.4.3</xref>).</p>
</sec>
<sec id="s12_4_2"><label>12.4.2</label>
<title>Nonlinear manifold-based (hyper-)reduction</title>
<p>The NM-ROM approach proposed in [<xref ref-type="bibr" rid="ref-47">47</xref>] addressed nonlinear dynamical systems, whose evolution was governed by a set of nonlinear ODEs, which had been obtained by a semi-discretization of the spatial domain, e.g., by means of finite elements:</p>
<p><disp-formula id="eqn-445"><label>(445)</label><mml:math id="mml-eqn-445" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover><mml:mi>x</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">d</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In the above relation, <inline-formula id="ieqn-2465"><mml:math id="mml-ieqn-2465"><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the parameterized solution of the problem, where <inline-formula id="ieqn-2466"><mml:math id="mml-ieqn-2466"><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mo>&#x2286;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> are an <inline-formula id="ieqn-2467"><mml:math id="mml-ieqn-2467"><mml:msub><mml:mi>n</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula>-dimensional vector of parameters; <inline-formula id="ieqn-2468"><mml:math id="mml-ieqn-2468"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the initial state. The function <inline-formula id="ieqn-2469"><mml:math id="mml-ieqn-2469"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula> represents the rate of change of the state, which is assumed to be nonlinear in the state <inline-formula id="ieqn-2470"><mml:math id="mml-ieqn-2470"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> and possibly also in its other arguments, i.e., time <inline-formula id="ieqn-2471"><mml:math id="mml-ieqn-2471"><mml:mi>t</mml:mi></mml:math></inline-formula> and the vector of parameters <inline-formula id="ieqn-2472"><mml:math id="mml-ieqn-2472"><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi></mml:math></inline-formula>:<xref ref-type="fn" rid="fn285"><sup>285</sup></xref><fn id="fn285"><label>285</label><p>In solid mechanics, we typically deal with second-order ODEs, which can be converted into a system of first-order ODEs by including velocities in the state space. For this reason, we prefer to use the term &#x2018;rate&#x2019; rather than &#x2018;velocity&#x2019; of <inline-formula id="ieqn-3324"><mml:math id="mml-ieqn-3324"><mml:mi mathvariant='bold-italic'>x</mml:mi></mml:math></inline-formula> in what follows.</p></fn></p>
<p><disp-formula id="eqn-446"><label>(446)</label><mml:math id="mml-eqn-446" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>:</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The fundamental idea of any projection-based ROM is to approximate the original solution space of the FOM by a comparatively low-dimensional space <inline-formula id="ieqn-2473"><mml:math id="mml-ieqn-2473"><mml:mi>&#x1D4AE;</mml:mi></mml:math></inline-formula>. In view of the aforementioned shortcomings of linear subspaces, the authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] proposed a representation in terms of a <italic>nonlinear manifold</italic>, which was described by the vector-valued function <inline-formula id="ieqn-2474"><mml:math id="mml-ieqn-2474"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula>, whose dimensionality was supposed to be much smaller than that of the FOM:</p>
<p><disp-formula id="eqn-447"><label>(447)</label><mml:math id="mml-eqn-447" display="block"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x007C;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>dim</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant='script'>S</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x226A;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula></p>
<p>Using the nonlinear function <inline-formula id="ieqn-2475"><mml:math id="mml-ieqn-2475"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula>, an approximation <inline-formula id="ieqn-2476"><mml:math id="mml-ieqn-2476"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy='true'>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> to the FOM&#x2019;s solution <inline-formula id="ieqn-2477"><mml:math id="mml-ieqn-2477"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> was constructed using a set of generalized coordinates <inline-formula id="ieqn-2478"><mml:math id="mml-ieqn-2478"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-448"><label>(448)</label><mml:math id="mml-eqn-448" display="block"><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mover><mml:mi>x</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2479"><mml:math id="mml-ieqn-2479"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denoted a (fixed) reference solution. The rate of change <inline-formula id="ieqn-2480"><mml:math id="mml-ieqn-2480"><mml:mover><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:math></inline-formula> was approximated by</p>
<p><disp-formula id="eqn-449"><label>(449)</label><mml:math id="mml-eqn-449" display="block"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x2248;</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:mtext>d</mml:mtext><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mtext>d</mml:mtext><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mfrac><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula></p>
<p>where the Jacobian <inline-formula id="ieqn-2481"><mml:math id="mml-ieqn-2481"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> spanned the tangent space to the manifold at <inline-formula id="ieqn-2482"><mml:math id="mml-ieqn-2482"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>. The initial conditions for the generalized coordinates were given by <inline-formula id="ieqn-2483"><mml:math id="mml-ieqn-2483"><mml:msub><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2484"><mml:math id="mml-ieqn-2484"><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> denoted the inverse function to <inline-formula id="ieqn-2485"><mml:math id="mml-ieqn-2485"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula>.</p>
<p>Note that <italic>linear subspace</italic> methods are included in the above relations as the special case where <inline-formula id="ieqn-2486"><mml:math id="mml-ieqn-2486"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula> is a linear function, which can be written in terms of a <italic>constant</italic> matrix <inline-formula id="ieqn-2487"><mml:math id="mml-ieqn-2487"><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-2488"><mml:math id="mml-ieqn-2488"><mml:mrow><mml:mi>g</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo mathvariant="bold">&#x03A6;</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>. In this case, the approximation to the solution in Eq. (<xref ref-type="disp-formula" rid="eqn-448">448</xref>) and their respective rates in Eq. (<xref ref-type="disp-formula" rid="eqn-449">449</xref>) are given by</p>
<p><disp-formula id="eqn-450"><label>(450)</label><mml:math id="mml-eqn-450" display="block"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mi>g</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x21D2;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>&#x2248;</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03A6;</mml:mi><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mfrac><mml:mrow><mml:mtext>d</mml:mtext><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mtext>d</mml:mtext><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x2248;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mtext>d</mml:mtext><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>&#x03A6;</mml:mi><mml:mfrac><mml:mrow><mml:mtext>d</mml:mtext><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mtext>d</mml:mtext><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula></p>
<p>As opposed to the nonlinear manifold Eq. (<xref ref-type="disp-formula" rid="eqn-449">449</xref>), the tangent space of the (linear) solution manifold is constant, i.e., <inline-formula id="ieqn-2489"><mml:math id="mml-ieqn-2489"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold">&#x03A6;</mml:mi></mml:math></inline-formula>; see [<xref ref-type="bibr" rid="ref-307">307</xref>]. We also mention the example of eigenmodes as a classical choice for basis vectors of reduced subspaces [<xref ref-type="bibr" rid="ref-309">309</xref>].</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] defined a <italic>residual function</italic> <inline-formula id="ieqn-2490"><mml:math id="mml-ieqn-2490"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> in the reduced set of coordinates by rewriting the governing ODE Eq. (<xref ref-type="disp-formula" rid="eqn-445">445</xref>) and substituting the approximation of the state Eq. (<xref ref-type="disp-formula" rid="eqn-448">448</xref>) and its rate Eq. (<xref ref-type="disp-formula" rid="eqn-449">449</xref>):<xref ref-type="fn" rid="fn286"><sup>286</sup></xref><fn id="fn286"><label>286</label><p>The tilde above the symbol for the residual function <inline-formula id="ieqn-3325"><mml:math id="mml-ieqn-3325"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-451">451</xref>) was used to indicate an approximation, consistent with the use of the tilde in the approximation of the state in Eq. (<xref ref-type="disp-formula" rid="eqn-448">448</xref>), i.e., <inline-formula id="ieqn-3326"><mml:math id="mml-ieqn-3326"><mml:mi>x</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>.</p></fn></p>
<p><disp-formula id="eqn-451"><label>(451)</label><mml:math id="mml-eqn-451" display="block"><mml:mrow><mml:mover accent='true'><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>&#x2212;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="bold-italic">f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>As <inline-formula id="ieqn-2491"><mml:math id="mml-ieqn-2491"><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&gt;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula>, the system of equations <inline-formula id="ieqn-2492"><mml:math id="mml-ieqn-2492"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> is over-determined and no solution exists in general. For this reason, a least-squares solution that minimized the square of the residual&#x2019;s Euclidean norm, which was denoted by <inline-formula id="ieqn-2493"><mml:math id="mml-ieqn-2493"><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, was sought instead [<xref ref-type="bibr" rid="ref-47">47</xref>]:</p>
<p><disp-formula id="eqn-452"><label>(452)</label><mml:math id="mml-eqn-452" display="block"><mml:mrow><mml:mover><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold-italic'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmin</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Requiring the derivative of the (squared) residual to vanish, we obtain the following set of equations,<xref ref-type="fn" rid="fn287"><sup>287</sup></xref><fn id="fn287"><label>287</label><p>As the authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] omitted a step-by-step derivation, we introduce it here for the sake of completeness.</p></fn></p>
<p><disp-formula id="eqn-453"><label>(453)</label><mml:math id="mml-eqn-453" display="block"><mml:mrow><mml:mfrac><mml:mo>&#x2202;</mml:mo><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mover><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow></mml:mfrac><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mover><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x00A0;</mml:mtext></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mover><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>&#x2212;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mn>0</mml:mn></mml:mstyle><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>which can be rearranged for the rate of the reduced vector of generalized coordinates as:</p>
<p><disp-formula id="eqn-454"><label>(454)</label><mml:math id="mml-eqn-454" display="block"><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi><mml:mo>&#x2020;</mml:mo></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi><mml:mo>&#x2020;</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The above system of ODEs, in which <inline-formula id="ieqn-2494"><mml:math id="mml-ieqn-2494"><mml:msubsup><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi><mml:mo>&#x2020;</mml:mo></mml:msubsup></mml:math></inline-formula> denotes the <italic>Moore-Penrose inverse</italic> of the Jacobian <inline-formula id="ieqn-2495"><mml:math id="mml-ieqn-2495"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, which is also referred to as <italic>pseudo-inverse</italic>, constitutes the ROM corresponding to the FOM governed by Eq. (<xref ref-type="disp-formula" rid="eqn-445">445</xref>). Eq. (<xref ref-type="disp-formula" rid="eqn-453">453</xref>) reveals that the minimization problem is equivalent to a projection of the full residual onto the low-dimensional subspace <inline-formula id="ieqn-2496"><mml:math id="mml-ieqn-2496"><mml:mi mathvariant="bold-italic">&#x1D4AE;</mml:mi></mml:math></inline-formula> by means of the Jacobian <inline-formula id="ieqn-2497"><mml:math id="mml-ieqn-2497"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula>. Note that the rate of the generalized coordinates <inline-formula id="ieqn-2498"><mml:math id="mml-ieqn-2498"><mml:mover><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:math></inline-formula> lies in the same tangent space, which is spanned by <inline-formula id="ieqn-2499"><mml:math id="mml-ieqn-2499"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula>. Therefore, the authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] described the projection Eq. (<xref ref-type="disp-formula" rid="eqn-453">453</xref>) as <italic>Galerkin projection</italic> and the ROM Eq. (<xref ref-type="disp-formula" rid="eqn-454">454</xref>) as <italic>nonlinear manifold (NM) Galerkin ROM</italic>. To construct the solutions, suitable time-integration schemes needed to be applied to the semi-discrete ROM Eq. (<xref ref-type="disp-formula" rid="eqn-454">454</xref>).</p>
<p>An alternative approach was also presented in [<xref ref-type="bibr" rid="ref-47">47</xref>] for the construction of ROMs, in which time-discretization was performed prior to the projection onto the low-dimensional solution subspace. For this purpose a uniform time-discretization with a step size <inline-formula id="ieqn-2500"><mml:math id="mml-ieqn-2500"><mml:mo>&#x0394;</mml:mo><mml:mi>t</mml:mi></mml:math></inline-formula> was assumed; the solution at time <inline-formula id="ieqn-2501"><mml:math id="mml-ieqn-2501"><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> was denoted by <inline-formula id="ieqn-2502"><mml:math id="mml-ieqn-2502"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. To develop the method for implicit integration schemes, the <italic>backward Euler</italic> method was chosen as example [<xref ref-type="bibr" rid="ref-47">47</xref>]:</p>
<p><disp-formula id="eqn-455"><label>(455)</label><mml:math id="mml-eqn-455" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The above integration rule implies that span of rate <inline-formula id="ieqn-2503"><mml:math id="mml-ieqn-2503"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:math></inline-formula>, which is given by evaluating (or by taking the &#x2018;snapshot&#x2019; of) the nonlinear function <inline-formula id="ieqn-2504"><mml:math id="mml-ieqn-2504"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula>, is included in the span of the state (&#x2018;solution snapshot&#x2019;) <inline-formula id="ieqn-2505"><mml:math id="mml-ieqn-2505"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-456"><label>(456)</label><mml:math id="mml-eqn-456" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mtext>span</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x2286;</mml:mo><mml:mrow><mml:mrow><mml:mtext>span</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mspace width="2em" /><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mspace width="2em" /><mml:mrow><mml:mrow><mml:mtext>span</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x2286;</mml:mo><mml:mrow><mml:mrow><mml:mtext>span</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>For the backward Euler scheme Eq. (<xref ref-type="disp-formula" rid="eqn-455">455</xref>), a residual function was defined in [<xref ref-type="bibr" rid="ref-47">47</xref>] as the difference</p>
<p><disp-formula id="eqn-457"><label>(457)</label><mml:math id="mml-eqn-457" display="block"><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:mi>&#x212C;</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Just as in the time-continuous domain, the system of equations <inline-formula id="ieqn-2506"><mml:math id="mml-ieqn-2506"><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>B</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mn>0</mml:mn></mml:mstyle></mml:math></inline-formula> is over-determined and, hence, was reformulated as a least-squares problem for the generalized coordinates <inline-formula id="ieqn-2507"><mml:math id="mml-ieqn-2507"><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-458"><label>(458)</label><mml:math id="mml-eqn-458" display="block"><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mstyle><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:munder><mml:mrow><mml:mtext>arg&#x00A0;min</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mtext>&#x00A0;</mml:mtext><mml:msubsup><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>B</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>To solve the least-squares problem, the Gauss-Newton method with starting point <inline-formula id="ieqn-2508"><mml:math id="mml-ieqn-2508"><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> was applied, i.e., the residual Eq. (<xref ref-type="disp-formula" rid="eqn-457">457</xref>) was expanded into a first-order Taylor polynomial in the generalized coordinates [<xref ref-type="bibr" rid="ref-47">47</xref>]:</p>
<p><disp-formula id="eqn-459"><label>(459)</label><mml:math id="mml-eqn-459" display="block"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>B</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2248;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>B</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>B</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>n</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;&#x200B;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>n</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>I</mml:mi></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mi>&#x0394;</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>J</mml:mi></mml:mstyle><mml:mi>g</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>n</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Unlike the time-continuous case, the Gauss-Newton method results in a projection involving not only the Jacobian <inline-formula id="ieqn-2509"><mml:math id="mml-ieqn-2509"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> defined in Eq. (<xref ref-type="disp-formula" rid="eqn-449">449</xref>), but also the Jacobian of the nonlinear function <inline-formula id="ieqn-2510"><mml:math id="mml-ieqn-2510"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-2511"><mml:math id="mml-ieqn-2511"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:math></inline-formula>. Therefore, the resulting reduced set of algebraic equations obtained upon a projection of the fully discrete FOM was referred to as <italic>nonlinear manifold least-squares Petrov-Galerkin (NM-LSPG)</italic> ROM [<xref ref-type="bibr" rid="ref-47">47</xref>].</p>
</sec>
<sec id="s12_4_3"><label>12.4.3</label>
<title>Autoencoder</title>
<p>Within the NS-ROM approach, it was proposed in [<xref ref-type="bibr" rid="ref-47">47</xref>] to construct <inline-formula id="ieqn-2512"><mml:math id="mml-ieqn-2512"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula> by training an <italic>autoencoder</italic>, i.e., a neural network that reproduces its input vector:</p>
<p><disp-formula id="eqn-460"><label>(460)</label><mml:math id="mml-eqn-460" display="block"><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold">a</mml:mi><mml:mi mathvariant="bold">e</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold">d</mml:mi><mml:mi mathvariant="bold">e</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi mathvariant="bold">e</mml:mi><mml:mi mathvariant="bold">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<fig id="fig-122">
<label>Figure 122</label>
<caption><title><italic>Dense vs. shallow decoder networks</italic> (Section <xref ref-type="sec" rid="s12_4_3">12.4.3</xref>). Contributing neurons (orange &#x201C;nodes&#x201D;) and connections (orange &#x201C;edges&#x201D;) lie in the &#x201C;active&#x201D; paths arriving at the selected outputs (solid orange &#x201C;nodes&#x201D;) from the decoder&#x2019;s inputs. In dense networks as the one in (a), each neuron in a layer is connected to all other neurons in both the preceeding layer (if it exists) and in the succeeding layer (if it exists). Fully-connected networks are characterized by dense weight matrices, see Section <xref ref-type="sec" rid="s4_4">4.4</xref>. In sparse networks as the decoder in (b), several connections among successive layers are dropped, resulting in sparsely populated weight matrices. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-122.tif"/>
</fig>
<p>As the above relation reveals, autoencoders are typically composed from two parts, i.e., <inline-formula id="ieqn-2513"><mml:math id="mml-ieqn-2513"><mml:mrow><mml:mtext>ae</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi mathvariant="bold">d</mml:mi><mml:mi mathvariant="bold">e</mml:mi></mml:mrow><mml:mo>&#x2218;</mml:mo><mml:mrow><mml:mi mathvariant="bold">e</mml:mi><mml:mi mathvariant="bold">n</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: The <italic>encoder</italic> &#x2018;codes&#x2019; inputs <inline-formula id="ieqn-2514"><mml:math id="mml-ieqn-2514"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> into a so-called <italic>latent state</italic> <inline-formula id="ieqn-2515"><mml:math id="mml-ieqn-2515"><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold">e</mml:mi><mml:mi mathvariant="bold">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (not to be confused with hidden states of neural networks). The <italic>decoder</italic> then reconstructs (&#x2018;decodes&#x2019;) an approximation of the input from the latent state, i.e., <inline-formula id="ieqn-2516"><mml:math id="mml-ieqn-2516"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold">d</mml:mi><mml:mi mathvariant="bold">e</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Both encoder and decoder can be represented by neural networks, e.g., feed-forward networks as done in [<xref ref-type="bibr" rid="ref-47">47</xref>]. As the authors of [<xref ref-type="bibr" rid="ref-78">78</xref>] noted, an autoencoder is &#x201C;not especially useful&#x201D; if it exactly learns the identity mapping for all possible inputs <inline-formula id="ieqn-2517"><mml:math id="mml-ieqn-2517"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula>. Instead, autoencoders are typically restricted in some way, e.g., by reducing the dimensionality of the latent space as compared to the dimension of inputs, i.e., <inline-formula id="ieqn-2518"><mml:math id="mml-ieqn-2518"><mml:mtext>dim</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&lt;</mml:mo><mml:mi>dim</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The restriction forces an autoencoder to focus on those aspects of the input, or input &#x2018;features&#x2019;, which are essential for the reconstruction of the input. The encoder network of such <italic>undercomplete autoencoder</italic><xref ref-type="fn" rid="fn288"><sup>288</sup></xref><fn id="fn288"><label>288</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], Chapter 14, p.493.</p></fn> performs a <italic>dimensionality reduction</italic>, which is exactly the aim of projection-based methods for constructing ROMs. Using nonlinear encoder/decoder networks, undercomplete autoencoders can be trained to represent low-dimensional subspaces of solutions to high-dimensional dynamical systems governed by Eq. (<xref ref-type="disp-formula" rid="eqn-445">445</xref>).<xref ref-type="fn" rid="fn289"><sup>289</sup></xref><fn id="fn289"><label>289</label><p>In fact, linear decoder networks combined with MSE-loss are equivalent to the <italic>Principal Components Analysis</italic> (PCA) (see [<xref ref-type="bibr" rid="ref-78">78</xref>], p.494), which, in turn, is equivalent to the discrete variant of POD by means of singular value decomposition (see Remark <xref ref-type="statement" rid="st12_2">12.2</xref>).</p></fn> In particular, the decoder network represents the nonlinear manifold <inline-formula id="ieqn-2519"><mml:math id="mml-ieqn-2519"><mml:mi>&#x1D4AE;</mml:mi></mml:math></inline-formula>, which described by the function <inline-formula id="ieqn-2520"><mml:math id="mml-ieqn-2520"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula> that maps the generalized coordinates of the ROM <inline-formula id="ieqn-2521"><mml:math id="mml-ieqn-2521"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula> onto the corresponding element <inline-formula id="ieqn-2522"><mml:math id="mml-ieqn-2522"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> of FOM&#x2019;s solution space. Accordingly, the generalized coordinates are identified with the autoencoder&#x2019;s latent state, i.e., <inline-formula id="ieqn-2523"><mml:math id="mml-ieqn-2523"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi></mml:math></inline-formula>; the encoder network represent the inverse mapping <inline-formula id="ieqn-2524"><mml:math id="mml-ieqn-2524"><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, which &#x201C;captures the most salient features&#x201D; of the FOM.</p>
<p>The input data, which was formed by the snapshots of solutions <inline-formula id="ieqn-2525"><mml:math id="mml-ieqn-2525"><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2526"><mml:math id="mml-ieqn-2526"><mml:mi>m</mml:mi></mml:math></inline-formula> denoted the number of snapshots, was normalized when training the network [<xref ref-type="bibr" rid="ref-47">47</xref>]. The shift by the referential state <inline-formula id="ieqn-2527"><mml:math id="mml-ieqn-2527"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and scaling by <inline-formula id="ieqn-2528"><mml:math id="mml-ieqn-2528"><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, i.e.,</p>
<p><disp-formula id="eqn-461"><label>(461)</label><mml:math id="mml-eqn-461" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>is such that each component of the normalized input ranges from <inline-formula id="ieqn-2529"><mml:math id="mml-ieqn-2529"><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> or <inline-formula id="ieqn-2530"><mml:math id="mml-ieqn-2530"><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>.<xref ref-type="fn" rid="fn290"><sup>290</sup></xref><fn id="fn290"><label>290</label><p>The authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] did not provide further details.</p></fn> The encoder maps a FOM&#x2019;s (normalized) solution snapshot onto a corresponding low-dimensional latent state <inline-formula id="ieqn-2531"><mml:math id="mml-ieqn-2531"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>, i.e., vector of generalized coordinates:</p>
<p><disp-formula id="eqn-462"><label>(462)</label><mml:math id="mml-eqn-462" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold">e</mml:mi><mml:mi mathvariant="bold">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold">e</mml:mi><mml:mi mathvariant="bold">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The decoder reconstructs the high-dimensional solution <inline-formula id="ieqn-2532"><mml:math id="mml-ieqn-2532"><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> from the low-dimensional state <inline-formula id="ieqn-2533"><mml:math id="mml-ieqn-2533"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula> inverting the normalization Eq. (<xref ref-type="disp-formula" rid="eqn-461">461</xref>):</p>
<p><disp-formula id="eqn-463"><label>(463)</label><mml:math id="mml-eqn-463" display="block"><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>g</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>d</mml:mi><mml:mi>e</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2298;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>where the operator &#x2298; denotes the element-wise division. The composition of encoder and decoder give the autoencoder, i.e.,</p>
<p><disp-formula id="eqn-464"><label>(464)</label><mml:math id="mml-eqn-464" display="block"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi mathvariant="bold">a</mml:mi><mml:mi mathvariant="bold">e</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="bold">d</mml:mi><mml:mi mathvariant="bold">e</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi mathvariant="bold">e</mml:mi><mml:mi mathvariant="bold">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2298;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>which is trained in an <italic>unsupervised</italic> way to (approximately) reproduce states <inline-formula id="ieqn-2535"><mml:math id="mml-ieqn-2535"><mml:mover accent='true'><mml:mi>X</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mn>...</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mi>m</mml:mi></mml:msub></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> from input snapshots <inline-formula id="ieqn-2536"><mml:math id="mml-ieqn-2536"><mml:mi mathvariant="bold-italic">X</mml:mi></mml:math></inline-formula>. To train autoencoders, the <italic>mean squared error</italic> (see <xref ref-type="sec" rid="s5_1_1">Sec. 5.1.1</xref>) is the natural choice for the loss function, i.e., <inline-formula id="ieqn-2537"><mml:math id="mml-ieqn-2537"><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mover><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover><mml:mo>&#x2212;</mml:mo><mml:mi>X</mml:mi><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mi>F</mml:mi></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:math></inline-formula>, where the Frobenius norm of matrices is used.</p>
<p>Both the encoder and the decoder are feed-forward neural networks with a single hidden layer. As activation function, it was proposed in [<xref ref-type="bibr" rid="ref-47">47</xref>] to use the sigmoid function (see Figure <xref ref-type="fig" rid="fig-30">30</xref> and Section <xref ref-type="sec" rid="s13_3_1">13.3.1</xref>) or the swish function (see Figure <xref ref-type="fig" rid="fig-139">139</xref>). It remains unspecified, however, which of the two were used in the numerical examples in [<xref ref-type="bibr" rid="ref-47">47</xref>]. Though the representational capacity of neural networks increases with depth, which is what &#x2018;deep learning&#x2019; is all about, the authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] deliberately used shallow networks to minimize the (computational) complexity of computing the decoder&#x2019;s Jacobian <inline-formula id="ieqn-2538"><mml:math id="mml-ieqn-2538"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula>. To further improve computational efficiency, they proposed to use a &#x201C;masked/sparse decoder&#x201D;, in which the weight matrix that mapped the hidden state onto the output <inline-formula id="ieqn-2539"><mml:math id="mml-ieqn-2539"><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> was sparse. As opposed to a fully populated weight matrix of fully-connected networks (Figure <xref ref-type="fig" rid="fig-122">122</xref> (a)), in which each component of the output vector depends on all components of the hidden state, the outputs of a sparse decoder (Figure <xref ref-type="fig" rid="fig-122">122</xref> (b)) only depend on the selected components of the hidden state.</p>
<fig id="fig-123">
<label>Figure 123</label>
<caption><title><italic>Sparsity masks</italic> (Section <xref ref-type="sec" rid="s12_4_3">12.4.3</xref>) used to realize sparse decoders in one- and two-dimensional problems. The structure of the respective binary-valued mask matrices <inline-formula id="ieqn-2010"><mml:math id="mml-ieqn-2010"><mml:mi mathvariant="bold-italic">S</mml:mi></mml:math></inline-formula> is inspired by grid-points required in the finite-difference approximation of the Laplace operator in one and two dimensions, respectively. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-123.tif"/>
</fig>
<p>Using our notation for feedforward networks and activation functions introduced in Section <xref ref-type="sec" rid="s4_4">4.4</xref>, the encoder network has the following structure:</p>
<p><disp-formula id="eqn-465"><label>(465)</label><mml:math id="mml-eqn-465" display="block"><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mstyle><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mstyle><mml:mrow><mml:mtext>en</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mstyle><mml:mrow><mml:mtext>en</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>z</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mtext>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;</mml:mtext><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>z</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mtext>&#x00A0;=&#x00A0;&#x00A0;</mml:mtext><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mstyle><mml:mrow><mml:mtext>en</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>b</mml:mi></mml:mstyle><mml:mrow><mml:mtext>en</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mstyle><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2540"><mml:math id="mml-ieqn-2540"><mml:msubsup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mi mathvariant="normal">e</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-2541"><mml:math id="mml-ieqn-2541"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> are dense matrices. The masked decoder, on the contrary, is characterized by sparse connections between the hidden layer and the output layer, which is realized as element-wise multiplication of a dense weight matrix <inline-formula id="ieqn-2542"><mml:math id="mml-ieqn-2542"><mml:msubsup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mi mathvariant="normal">e</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> with a binary-valued &#x201C;mask matrix&#x201D; <inline-formula id="ieqn-2543"><mml:math id="mml-ieqn-2543"><mml:mi mathvariant="bold-italic">S</mml:mi></mml:math></inline-formula> reflecting the connectivity among the two layers:</p>
<p><disp-formula id="eqn-466"><label>(466)</label><mml:math id="mml-eqn-466" display="block"><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>d</mml:mi><mml:mi>e</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="italic">x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">S</mml:mi></mml:mstyle><mml:mo>&#x2299;</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mstyle><mml:mrow><mml:mtext>de</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mstyle mathvariant="bold-italic"><mml:mi>y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mstyle mathvariant="bold-italic"><mml:mi>b</mml:mi></mml:mstyle><mml:mrow><mml:mtext>de</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant='fraktur'>s</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>z</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mspace width="7em" /><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>z</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mstyle><mml:mrow><mml:mtext>de</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mstyle><mml:mrow><mml:mtext>de</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The structure of the sparsity mask <inline-formula id="ieqn-2544"><mml:math id="mml-ieqn-2544"><mml:mi mathvariant="bold-italic">S</mml:mi></mml:math></inline-formula>, in turn, is inspired by the pattern (&#x201C;stencil&quot;) of grid-points involved in a finite-difference approximation of the Laplace operator, see Figure <xref ref-type="fig" rid="fig-123">123</xref> for the one- and two-dimensional cases.</p>
</sec>
<sec id="s12_4_4"><label>12.4.4</label>
<title>Hyper-reduction</title>
<p>Irrespective of how small the dimensionality of the ROM&#x2019;s solution subspace <inline-formula id="ieqn-2545"><mml:math id="mml-ieqn-2545"><mml:mi>&#x1D4AE;</mml:mi></mml:math></inline-formula> is, we cannot expect a reduction in computational efforts in nonlinear problems. Both the time-continuous NM-Galerkin ROM in Eq. (<xref ref-type="disp-formula" rid="eqn-454">454</xref>) and the time-discrete NM-LSPG ROM in Eq. (<xref ref-type="disp-formula" rid="eqn-458">458</xref>) require repeated evaluations of the nonlinear function <inline-formula id="ieqn-2546"><mml:math id="mml-ieqn-2546"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula> and, in case implicit time-integration schemes, also its Jacobian <inline-formula id="ieqn-2547"><mml:math id="mml-ieqn-2547"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula>, whose computational complexity is determined by the FOM&#x2019;s dimensionality. The term <italic>hyper-reduction</italic> subsumes techniques in model-order reduction to overcome the necessity for evaluations that scale with the FOM&#x2019;s size. Among hyper-reduction techniques, the <italic>Discrete Empirical Interpolation Method</italic> (DEIM) [<xref ref-type="bibr" rid="ref-310">310</xref>] and <italic>Gauss-Newton with Approximated Tensors</italic> (GNAT) [<xref ref-type="bibr" rid="ref-311">311</xref>] have gained significant attention in recent years.</p>
<p>In [<xref ref-type="bibr" rid="ref-47">47</xref>], a variant of GNAT relying on solution snapshots for the approximation of the nonlinear residual term <inline-formula id="ieqn-2548"><mml:math id="mml-ieqn-2548"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> was used, and therefore the reason to extend the GNAT acronym with &#x201C;SNS&#x201D; for <italic>&#x2019;solution-based subspace&#x2019;</italic> (SNS), i.e., <italic>GNAT-SNS</italic>, see [<xref ref-type="bibr" rid="ref-312">312</xref>]. The idea of DEIM, GNAT and their SNS variants takes up the leitmotif of projection-based methods: The approximation of the full-order residual <inline-formula id="ieqn-2549"><mml:math id="mml-ieqn-2549"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> is in turn linearly interpolated by a low-dimensional vector <inline-formula id="ieqn-2550"><mml:math id="mml-ieqn-2550"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula> and an appropriate set of basis vectors <inline-formula id="ieqn-2551"><mml:math id="mml-ieqn-2551"><mml:msub><mml:mi mathvariant="bold-italic">&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-2552"><mml:math id="mml-ieqn-2552"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-467"><label>(467)</label><mml:math id="mml-eqn-467" display="block"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2248;</mml:mo><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>r</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x226A;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula></p>
<p>In DEIM methods, the residual <inline-formula id="ieqn-2553"><mml:math id="mml-ieqn-2553"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> of the time-continuous problem in Eq. (<xref ref-type="disp-formula" rid="eqn-451">451</xref>) is approximated, whereas <inline-formula id="ieqn-2554"><mml:math id="mml-ieqn-2554"><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:math></inline-formula> it to be replaced by <inline-formula id="ieqn-2555"><mml:math id="mml-ieqn-2555"><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x212C;</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup></mml:math></inline-formula> in the above relation when applying GNAT, which builds upon Petrov-Galerkin ROMs as in Eq. (<xref ref-type="disp-formula" rid="eqn-459">459</xref>). Both DEIM and GNAT variants use <italic>gappy POD</italic> to determine the basis <inline-formula id="ieqn-2556"><mml:math id="mml-ieqn-2556"><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula> for the interpolation of the nonlinear residual. Gappy POD originates in a method for image reconstruction proposed [<xref ref-type="bibr" rid="ref-313">313</xref>] under the name of a Karhunen-Lo&#x00E8;ve procedure, in which images were reconstructed from individual pixels, i.e., from <italic>gappy data</italic>. In the present context of MOR, gappy POD aims at reconstructing the full-order residual <inline-formula id="ieqn-2557"><mml:math id="mml-ieqn-2557"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> from a small, <inline-formula id="ieqn-2558"><mml:math id="mml-ieqn-2558"><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula>-dimensional subset of its components.</p>
<p>The matrix <inline-formula id="ieqn-2559"><mml:math id="mml-ieqn-2559"><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula> was computed by a singular value decomposition (SVD) on the snapshots of data. Unlike original DEIM and GNAT methods, their SNS variants did not use snapshots of the nonlinear residual <inline-formula id="ieqn-2560"><mml:math id="mml-ieqn-2560"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> and <inline-formula id="ieqn-2561"><mml:math id="mml-ieqn-2561"><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x212C;</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup></mml:math></inline-formula>, respectively, but SVD was performed on solution snapshots <inline-formula id="ieqn-2562"><mml:math id="mml-ieqn-2562"><mml:mi mathvariant="bold-italic">X</mml:mi></mml:math></inline-formula> instead. The use of solution snapshots was motivated by the fact that the span of snapshots of the nonlinear term <inline-formula id="ieqn-2563"><mml:math id="mml-ieqn-2563"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula> was included in the span of of solution snapshots for conventional time-integration schemes, see Eq. (<xref ref-type="disp-formula" rid="eqn-456">456</xref>). The vector of <italic>&#x201C;generalized coordinates of the nonlinear residual term&#x201D;</italic> minimizes the square of the Euclidean distance of selected components of full-order residual <inline-formula id="ieqn-2564"><mml:math id="mml-ieqn-2564"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> and respective components of its reconstruction:</p>
<p><disp-formula id="eqn-468"><label>(468)</label><mml:math id="mml-eqn-468" display="block"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mstyle><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmin</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mtext>&#x00A0;</mml:mtext><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">Z</mml:mi></mml:mstyle><mml:mi>T</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>e</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>e</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mo>&#x226A;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula></p>
<p>The matrix <inline-formula id="ieqn-2565"><mml:math id="mml-ieqn-2565"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>, which was referred to as <italic>sampling matrix</italic>, extracts a set of components from the full-order vectors <inline-formula id="ieqn-2566"><mml:math id="mml-ieqn-2566"><mml:mi mathvariant="bold-italic">v</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>. For this purpose, the sampling matrix was built from unit vectors <inline-formula id="ieqn-2567"><mml:math id="mml-ieqn-2567"><mml:msub><mml:mi mathvariant="bold-italic">e</mml:mi><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, having the value one at the <inline-formula id="ieqn-2568"><mml:math id="mml-ieqn-2568"><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>-th component, which corresponded to the component of <inline-formula id="ieqn-2569"><mml:math id="mml-ieqn-2569"><mml:mi mathvariant="bold-italic">v</mml:mi></mml:math></inline-formula> to be selected (&#x2018;sampled&#x2019;), and the value zero otherwise. The components selected by <inline-formula id="ieqn-2570"><mml:math id="mml-ieqn-2570"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> are equivalently described by the (ordered) set of <italic>sampling indices</italic> <inline-formula id="ieqn-2571"><mml:math id="mml-ieqn-2571"><mml:mi>&#x2110;</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, which are represented by the vector <inline-formula id="ieqn-2572"><mml:math id="mml-ieqn-2572"><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:msub><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>. As the number of components of <inline-formula id="ieqn-2573"><mml:math id="mml-ieqn-2573"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> used for its reconstruction may be larger than the dimensionality of the reduced-order vector <inline-formula id="ieqn-2574"><mml:math id="mml-ieqn-2574"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-2575"><mml:math id="mml-ieqn-2575"><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mo>&#x2265;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula>, Eq. (<xref ref-type="disp-formula" rid="eqn-468">468</xref>) generally constitutes a least-squares problem, the solution of which follows as</p>
<p><disp-formula id="eqn-469"><label>(469)</label><mml:math id="mml-eqn-469" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mstyle><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Substituting the above result in Eq. (<xref ref-type="disp-formula" rid="eqn-467">467</xref>), the FOM&#x2019;s residual can be interpolated using an <italic>oblique projection matrix</italic> <inline-formula id="ieqn-2576"><mml:math id="mml-ieqn-2576"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AB;</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> as</p>
<p><disp-formula id="eqn-470"><label>(470)</label><mml:math id="mml-eqn-470" display="block"><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2248;</mml:mo><mml:mi mathvariant="bold-italic">&#x1D4AB;</mml:mi><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AB;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold-italic">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant="double-struck">R</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>The above representation in terms of the projection matrix <inline-formula id="ieqn-2577"><mml:math id="mml-ieqn-2577"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AB;</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> somewhat hides the main point of hyperreduction. In fact, we do not apply <inline-formula id="ieqn-2578"><mml:math id="mml-ieqn-2578"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AB;</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> to the full-order residual <inline-formula id="ieqn-2579"><mml:math id="mml-ieqn-2579"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, which would be tautological. Unrolling the definition of <inline-formula id="ieqn-2580"><mml:math id="mml-ieqn-2580"><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x1D4AB;</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula>, we note that <inline-formula id="ieqn-2581"><mml:math id="mml-ieqn-2581"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> is a vector containing only a small subset of components of the full-order residual <inline-formula id="ieqn-2582"><mml:math id="mml-ieqn-2582"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>. In other words, to evaluate the approximation in Eq. (<xref ref-type="disp-formula" rid="eqn-470">470</xref>) only <inline-formula id="ieqn-2583"><mml:math id="mml-ieqn-2583"><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mo>&#x226A;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> of components of <inline-formula id="ieqn-2584"><mml:math id="mml-ieqn-2584"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> need to be computed, i.e., the computational cost no longer scales with the FOM&#x2019;s dimensionality.</p>
<p>Several methods have been proposed to efficiently construct a suitable set of sampling indices <inline-formula id="ieqn-2585"><mml:math id="mml-ieqn-2585"><mml:mi>&#x2110;</mml:mi></mml:math></inline-formula>, see, e.g., [<xref ref-type="bibr" rid="ref-310">310</xref>], [<xref ref-type="bibr" rid="ref-311">311</xref>], [<xref ref-type="bibr" rid="ref-314">314</xref>]. These methods share the property of being <italic>greedy algorithms</italic>, i.e., algorithms that sequentially (&#x201C;inductively&#x201D;) create optimal sampling indices using some suitable metric. For instance, the authors of [<xref ref-type="bibr" rid="ref-310">310</xref>] selected the <inline-formula id="ieqn-2586"><mml:math id="mml-ieqn-2586"><mml:mi>j</mml:mi></mml:math></inline-formula>-th index corresponding to the component of the gappy reconstruction of the <inline-formula id="ieqn-2587"><mml:math id="mml-ieqn-2587"><mml:mi>j</mml:mi></mml:math></inline-formula>-th POD-mode <inline-formula id="ieqn-2588"><mml:math id="mml-ieqn-2588"><mml:msub><mml:mover accent='true'><mml:mi>&#x03D5;</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> showing the largest error compared to the original mode <inline-formula id="ieqn-2589"><mml:math id="mml-ieqn-2589"><mml:msub><mml:mi mathvariant="bold-italic">&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. For the reconstruction, the first <inline-formula id="ieqn-2590"><mml:math id="mml-ieqn-2590"><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> POD-modes were used, i.e.,</p>
<p><disp-formula id="eqn-471"><label>(471)</label><mml:math id="mml-eqn-471" display="block"><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2248;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mover accent='true'><mml:mi>&#x03D5;</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The reduced-order vector <inline-formula id="ieqn-2591"><mml:math id="mml-ieqn-2591"><mml:msub><mml:mover accent='true'><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> minimizes the square of the Euclidean distance between <inline-formula id="ieqn-2592"><mml:math id="mml-ieqn-2592"><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> components (selected by <inline-formula id="ieqn-2593"><mml:math id="mml-ieqn-2593"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>) of the <inline-formula id="ieqn-2594"><mml:math id="mml-ieqn-2594"><mml:mi>j</mml:mi></mml:math></inline-formula>-th mode <inline-formula id="ieqn-2595"><mml:math id="mml-ieqn-2595"><mml:msub><mml:mi mathvariant="bold-italic">&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and its gappy reconstruction <inline-formula id="ieqn-2596"><mml:math id="mml-ieqn-2596"><mml:msub><mml:mover><mml:mi>&#x03D5;</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-472"><label>(472)</label><mml:math id="mml-eqn-472" display="block"><mml:msub><mml:mover accent='true'><mml:mi>&#x03D5;</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmin</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:msubsup><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>The key idea of the greedy algorithm is to select additional indices to minimize the error of the gappy reconstruction. Therefore, the component of the reconstructed mode that differs most (in terms of magnitude) from the original mode defines the <inline-formula id="ieqn-2597"><mml:math id="mml-ieqn-2597"><mml:mi>j</mml:mi></mml:math></inline-formula>-th sampling index:</p>
<p><disp-formula id="eqn-473"><label>(473)</label><mml:math id="mml-eqn-473" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mrow><mml:mtext>arg max</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mi mathvariant="normal">&#x2216;</mml:mi><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>i</mml:mi></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>A pseudocode representation of the greedy approach for selecting sampling indices is given in Algorithm <xref ref-type="fig" rid="fig-166">8</xref>. To start the algorithm, the first sampling index is chosen according to the largest (in terms of magnitude) component of the first POD-mode, i.e., <inline-formula id="ieqn-2598"><mml:math id="mml-ieqn-2598"><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow></mml:munder><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">|</mml:mo></mml:math></inline-formula>.</p> 
<p>The authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] substituted the residual vector <inline-formula id="ieqn-2605"><mml:math id="mml-ieqn-2605"><mml:mrow><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> by its gappy reconstruction, i.e., <inline-formula id="ieqn-2606"><mml:math id="mml-ieqn-2606"><mml:mrow><mml:mi>&#x1D4AB;</mml:mi><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> in the minimization problem in Eq. (<xref ref-type="disp-formula" rid="eqn-452">452</xref>). Unlike the original Galerkin ROM in Eq. (<xref ref-type="disp-formula" rid="eqn-452">452</xref>), the rate of the reduced vector of generalized coordinate does not minimize the (square of the) FOM&#x2019;s residual <inline-formula id="ieqn-2607"><mml:math id="mml-ieqn-2607"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, but the corresponding reduced residual <inline-formula id="ieqn-2608"><mml:math id="mml-ieqn-2608"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, which are related by <inline-formula id="ieqn-2609"><mml:math id="mml-ieqn-2609"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>r</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-474"><label>(474)</label><mml:math id="mml-eqn-474" display="block"><mml:mrow><mml:mover><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>arg&#x00A0;min</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>r</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>arg&#x00A0;min</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>T</mml:mi></mml:msup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent='true'><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='true'>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x00A0;</mml:mtext></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>From the above minimization problem, the ROM&#x2019;s ODEs were determined in [<xref ref-type="bibr" rid="ref-47">47</xref>] by taking the derivative with respect to the reduced vector of generalized velocities <inline-formula id="ieqn-2610"><mml:math id="mml-ieqn-2610"><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>, which was evaluated at <inline-formula id="ieqn-2611"><mml:math id="mml-ieqn-2611"><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mover><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:math></inline-formula>, i.e.,<xref ref-type="fn" rid="fn291"><sup>291</sup></xref><fn id="fn291"><label>291</label><p>Note again that the step-by-step derivations of the ODEs in Eq. (<xref ref-type="disp-formula" rid="eqn-477">477</xref>) were not given by the authors of [<xref ref-type="bibr" rid="ref-47">47</xref>], which is why we provide it in our review paper for the sake of clarity.</p></fn></p>
<p><disp-formula id="eqn-475"><label>(475)</label><mml:math id="mml-eqn-475" display="block"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mn>2</mml:mn><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold-italic">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold-italic">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow></mml:mfrac><mml:mtext>&#x00A0;</mml:mtext></mml:mrow> <mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="5em" /><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold-italic">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold-italic">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="5em" /><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>2</mml:mn><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold-italic">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mrow> <mml:mrow><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow> <mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo>(</mml:mo> <mml:mrow><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>&#x2212;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-166">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-166.tif"/>
</fig>
<p>where the definition of the residual vector <inline-formula id="ieqn-2612"><mml:math id="mml-ieqn-2612"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> has been introduced in Eq. (<xref ref-type="disp-formula" rid="eqn-451">451</xref>). We therefore obtain the following linear system of equations for <inline-formula id="ieqn-2613"><mml:math id="mml-ieqn-2613"><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:math></inline-formula>,</p>
<p><disp-formula id="eqn-476"><label>(476)</label><mml:math id="mml-eqn-476" display="block"><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo mathvariant="bold-italic">&#x00B5;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>which, using the notion of the pseudo-inverse, is resolved for <inline-formula id="ieqn-2614"><mml:math id="mml-ieqn-2614"><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:math></inline-formula> to give, complemented by proper initial conditions <inline-formula id="ieqn-2615"><mml:math id="mml-ieqn-2615"><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, the governing systems of ODEs of the <italic>hyper-reduced nonlinear manifold, least-squares-Galerkin</italic> (NM-LS-Galerkin-HR) ROM [<xref ref-type="bibr" rid="ref-47">47</xref>]:</p>
<p><disp-formula id="eqn-477"><label>(477)</label><mml:math id="mml-eqn-477" display="block"><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x2009;&#x2009;&#x2009;&#x2009;&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<statement id="st12_6"><title>Remark 12.6.</title>
<p><italic>Equivalent minimization problems</italic>. Note the subtle difference between the minimization problems that govern the ROMs with and without hyper-reduction, see Eqs. (452) and (474), respectively: For the case without hyper-reduction, see Eq. (<xref ref-type="disp-formula" rid="eqn-452">452</xref>), the minimum is sought for the approximate <italic>full-dimensional</italic> residual <inline-formula id="ieqn-2616"><mml:math id="mml-ieqn-2616"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>. In the hyper-reduced variant Eq. (<xref ref-type="disp-formula" rid="eqn-474">474</xref>), the authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] aimed, however, at minimizing the projected residual <inline-formula id="ieqn-2617"><mml:math id="mml-ieqn-2617"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>, which was related to its full-dimensional counterpart by the residual basis matrix <inline-formula id="ieqn-2618"><mml:math id="mml-ieqn-2618"><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula>, i.e., <inline-formula id="ieqn-2619"><mml:math id="mml-ieqn-2619"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:math></inline-formula>. Using the full-order residual also in the hyper-reduced ROM translates into the following minimization problem:</p>
<p><disp-formula id="eqn-478"><label>(478)</label><mml:math id="mml-eqn-478" display="block"><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mi>arg</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mi>min</mml:mi></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:msubsup><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>arg&#x00A0;min</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:msubsup><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow> <mml:mo>&#x2016;</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Repeating the steps of the derivation in Eq. (<xref ref-type="disp-formula" rid="eqn-475">475</xref>) then gives</p>
<p><disp-formula id="eqn-479"><label>(479)</label><mml:math id="mml-eqn-479" display="block"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext>&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="1em" /><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext>&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mrow></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="1em" /><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext>&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>i.e., using the identity <inline-formula id="ieqn-2620"><mml:math id="mml-ieqn-2620"><mml:msubsup><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi><mml:mo>&#x2020;</mml:mo></mml:msubsup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula>, we recover exactly the same result as in the hyper-reduced case in Eq. (<xref ref-type="disp-formula" rid="eqn-477">477</xref>). The only requirement is that the residual basis vectors <inline-formula id="ieqn-2621"><mml:math id="mml-ieqn-2621"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2622"><mml:math id="mml-ieqn-2622"><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula> need to be linearly independent.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<statement id="st12_7"><title>Remark 12.7.</title>
<p>Further reduction of the system operator? At first glance, the operator in the ROM&#x2019;s governing equations in Eq. (<xref ref-type="disp-formula" rid="eqn-477">477</xref>) appears to be further reducible:</p>
<p><disp-formula id="eqn-480"><label>(480)</label><mml:math id="mml-eqn-480" display="block"><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Note that the product <inline-formula id="ieqn-2623"><mml:math id="mml-ieqn-2623"><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2020;</mml:mo></mml:msup></mml:math></inline-formula> generally does not, however, evaluate to identity, since our particular definition of the pseudo inverse of <inline-formula id="ieqn-2624"><mml:math id="mml-ieqn-2624"><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is a <italic>left inverse</italic>, for which <inline-formula id="ieqn-2625"><mml:math id="mml-ieqn-2625"><mml:msup><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:math></inline-formula> holds, but <inline-formula id="ieqn-2626"><mml:math id="mml-ieqn-2626"><mml:mi mathvariant="bold-italic">A</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mo>&#x2260;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">I</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:math></inline-formula>.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement><p>We first consider linear subspace methods, for which the Jacobian <inline-formula id="ieqn-2627"><mml:math id="mml-ieqn-2627"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi mathvariant="bold">&#x03A6;</mml:mi></mml:math></inline-formula> that spans the tangent space of the solution manifold is constant, see Eq. (<xref ref-type="disp-formula" rid="eqn-450">450</xref>). Substituting <inline-formula id="ieqn-2628"><mml:math id="mml-ieqn-2628"><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula> for <inline-formula id="ieqn-2629"><mml:math id="mml-ieqn-2629"><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-2630"><mml:math id="mml-ieqn-2630"><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext></mml:math></inline-formula> for <inline-formula id="ieqn-2631"><mml:math id="mml-ieqn-2631"><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula>, the above set of ODEs in Eq. (<xref ref-type="disp-formula" rid="eqn-477">477</xref>) governing the NM-Galerkin-HR ROM reduces to the corresponding <italic>linear subspace</italic> (LS-Galerkin-HR) ROM:</p>
<p><disp-formula id="eqn-481"><label>(481)</label><mml:math id="mml-eqn-481" display="block"><mml:mover accent='true'><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x02D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Note that, in linear subspace methods, the operator <inline-formula id="ieqn-2632"><mml:math id="mml-ieqn-2632"><mml:msup><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup></mml:math></inline-formula> is independent of the solution and, hence, <italic>&#x201C;can be pre-computed once for all&#x201D;</italic>. The products <inline-formula id="ieqn-2633"><mml:math id="mml-ieqn-2633"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2634"><mml:math id="mml-ieqn-2634"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold">&#x03A6;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2635"><mml:math id="mml-ieqn-2635"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula> need not be evaluated explicitly.</p>
<fig id="fig-124">
<label>Figure 124</label>
<caption><title><italic>Subnet construction</italic> (Section <xref ref-type="sec" rid="s12_4_4">12.4.4</xref>). To reduce computational cost, a subnet representing the set of active paths, which comprise all neurons and connections needed for the evaluation of selected outputs (highlighted in orange), i.e., the reduced residual <inline-formula id="ieqn-2011"><mml:math id="mml-ieqn-2011"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>, is constructed (left). The size of the hidden layer of the subnet depends on which output components of the decoder are needed for the reconstruction of the full-order residual. If the full-order residual is reconstructed from from successive outputs of the decoder, the number of neurons in the hidden layer involved in the evaluation becomes minimal due to the specific sparsity patterns proposed. Uniformly distributed components show the least overlap in terms of hidden-layer neurons required, which is why the subnet and therefore the computational cost in the hyperreduction approach is maximal (right). (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-124.tif"/>
</fig>
<p>Instead, only those rows of <inline-formula id="ieqn-2636"><mml:math id="mml-ieqn-2636"><mml:msub><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2637"><mml:math id="mml-ieqn-2637"><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext></mml:math></inline-formula> and <inline-formula id="ieqn-2638"><mml:math id="mml-ieqn-2638"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula> which are selected by the sampling matrix <inline-formula id="ieqn-2639"><mml:math id="mml-ieqn-2639"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> need to be extracted or computed when evaluating the above operator. In the context of MOR, pre-computations as, e.g., the above operator, but, more importantly, also the computationally demanding collection of full-order solutions and residual snapshots and subsequent SVDs, are attributed to the <italic>offline phase</italic> or <italic>stage</italic>, see, e.g., [<xref ref-type="bibr" rid="ref-307">307</xref>]. Ideally, the <italic>online phase</italic> only requires evaluations of quantities that scale with the dimensionality of the ROM.</p>
<p>By keeping track of which components of full-order solution <inline-formula id="ieqn-2640"><mml:math id="mml-ieqn-2640"><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> are involved in the computation of selected components of the nonlinear term, i.e., <inline-formula id="ieqn-2641"><mml:math id="mml-ieqn-2641"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula>, the computational cost can be reduced even further. In other words, we need not reconstruct all components of the full-order solution <inline-formula id="ieqn-2642"><mml:math id="mml-ieqn-2642"><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, but only those components, which are required for the evaluation of <inline-formula id="ieqn-2643"><mml:math id="mml-ieqn-2643"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula>. However, the number of components of <inline-formula id="ieqn-2644"><mml:math id="mml-ieqn-2644"><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> that is needed for this purpose, which translates into the number of products of rows in <inline-formula id="ieqn-2645"><mml:math id="mml-ieqn-2645"><mml:mtext mathvariant="bold">&#x03A6;</mml:mtext></mml:math></inline-formula> with the reduced-order solution <inline-formula id="ieqn-2646"><mml:math id="mml-ieqn-2646"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>, is typically much larger than the number of sampling indices <inline-formula id="ieqn-2647"><mml:math id="mml-ieqn-2647"><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, i.e., the cardinality of the set <inline-formula id="ieqn-2648"><mml:math id="mml-ieqn-2648"><mml:mi>&#x2110;</mml:mi></mml:math></inline-formula>.</p>
<p>To explain this discrepancy, assume the full-order model to be obtained upon a finite-element discretization. Given some particular nodal point, all finite elements sharing the node contribute to the corresponding components of the nonlinear function <inline-formula id="ieqn-2649"><mml:math id="mml-ieqn-2649"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula>. So we generally must evaluate several element when computing a single component of <inline-formula id="ieqn-2650"><mml:math id="mml-ieqn-2650"><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula> corresponding to a single sampling index <inline-formula id="ieqn-2651"><mml:math id="mml-ieqn-2651"><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow></mml:math></inline-formula>, which, in turn, involves coordinates of all elements associated with the <inline-formula id="ieqn-2652"><mml:math id="mml-ieqn-2652"><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>-th degree of freedom.<xref ref-type="fn" rid="fn292"><sup>292</sup></xref><fn id="fn292"><label>292</label><p>The &#x201C;Unassembled DEIM&#x201D; (UDEIM) method proposed in [<xref ref-type="bibr" rid="ref-315">315</xref>] provides a partial remedy for that issue in the context of finite-element problems. In UDEIM, the algorithm is applied to the unassembled residual vector, i.e., the set of element residuals, which restricts the dependency among generalized coordinates to individual elements.</p></fn></p>
<p>For nonlinear manifold methods, we cannot expect much improvement in computational efficiency by the hyper-reduction. As a matter of fact, the &#x2018;nonlinearity&#x2019; becomes twofold if the reduced subspace is a nonlinear manifold: We do not only have to compute selected components of the nonlinear term <inline-formula id="ieqn-2653"><mml:math id="mml-ieqn-2653"><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">f</mml:mi></mml:math></inline-formula>, we need to evaluate the nonlinear manifold <inline-formula id="ieqn-2654"><mml:math id="mml-ieqn-2654"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula>. More importantly, from a computational point of view, also relevant rows of the Jacobian of the nonlinear manifold, which are extracted by <inline-formula id="ieqn-2655"><mml:math id="mml-ieqn-2655">
<mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">J</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></inline-formula>, must be re-computed for every update of the (reduced-order) solution <inline-formula id="ieqn-2656"><mml:math id="mml-ieqn-2656"><mml:mover accent='true'><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:math></inline-formula>, see Eq. (<xref ref-type="disp-formula" rid="eqn-477">477</xref>).</p>
<p>For Petrov-Galerkin-type variants of ROMs, hyper-reduction works in exactly the same way as with their Galerkin counterparts. The residual in the minimization problem in Eq. (<xref ref-type="disp-formula" rid="eqn-458">458</xref>) is approximated by a gappy reconstruction, i.e.,</p>
<p><disp-formula id="eqn-482"><label>(482)</label><mml:math id="mml-eqn-482" display="block"><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>arg&#x00A0;min</mml:mtext></mml:mrow><mml:mrow><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mtext>&#x2009;</mml:mtext><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mtext>&#x2009;&#x2009;</mml:mtext><mml:msubsup><mml:mrow><mml:mo>&#x2016;</mml:mo> <mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi mathvariant="bold">&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2020;</mml:mo></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">Z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msubsup><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mrow><mml:mi>B</mml:mi><mml:mi>E</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>v</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">&#x00B5;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x2016;</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>From a computational point of view, the same implications apply to Petrov-Galerkin ROMs as for Galerkin-type ROMs, which is why we focus on the latter in our review.</p>
<p>In the approach of [<xref ref-type="bibr" rid="ref-47">47</xref>], the nonlinear manifold <inline-formula id="ieqn-2657"><mml:math id="mml-ieqn-2657"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula> was represented by a feed-forward neural network, i.e., essentially the decoder of the proposed sparse autoencoder, see Eq. (<xref ref-type="disp-formula" rid="eqn-463">463</xref>).</p>
<p>The computational cost of evaluating the decoder and its Jacobian scales with the number of parameters of the neural network. Both shallowness and sparsity of the decoder network already account for computational efficiency in regard of the number of parameters.</p>
<p>Additionally, the authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] traced &#x201C;active paths&#x201D; when evaluating selected components of the decoder and its Jacobian of the hyper-reduced model. The set of active paths comprises all those connections and neurons of the decoder network which are involved in evaluations of its outputs. Figure <xref ref-type="fig" rid="fig-124">124</xref> (left) highlights the active paths for the computations of the components of the reduced residual <inline-formula id="ieqn-2658"><mml:math id="mml-ieqn-2658"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, from which the full residual vector <inline-formula id="ieqn-2659"><mml:math id="mml-ieqn-2659"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> is reconstructed within the hyper-reduction method, in orange.</p>
<p>Given all active paths, a subnet of the decoder network is constructed to only evaluate those components of the full-order state which are required to compute the hyper-reduced residual. The computational costs to compute the residual and its Jacobian depends on the size of the subnet. As both input and output dimension are given, size translates into width of the (single) hidden layer. The size of the hidden layer, in turn, depends on the distribution of the sampling indices <inline-formula id="ieqn-2660"><mml:math id="mml-ieqn-2660"><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, i.e., from which components the full residual <inline-formula id="ieqn-2661"><mml:math id="mml-ieqn-2661"><mml:mover accent='true'><mml:mi>r</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> is reconstructed.</p>
<p>For the sparsity patterns assumed, successive output components show the largest overlap in terms of the number of neurons in the hidden layer involved in the evaluation, whereas the overlap is minimal in case of equally spaced outputs.</p>
<p>The cases of successive and equally distributed sampling indices constitute extremal cases, for which the computational time for the evaluation of both the residual and its Jacobian of the 2D-example (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>) are illustrated as a function of the dimensionality of the reduced residual (&#x201C;number of sampling points&#x201D;) in Figure <xref ref-type="fig" rid="fig-124">124</xref> (right).</p>
<fig id="fig-125">
<label>Figure 125</label>
<caption><title><italic>2-D Burger&#x2019;s equation. Solution snapshots of full and reduced-order models</italic> (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>). From left to right, the components <inline-formula id="ieqn-2012"><mml:math id="mml-ieqn-2012"><mml:mi>u</mml:mi></mml:math></inline-formula> (top row) and <inline-formula id="ieqn-2013"><mml:math id="mml-ieqn-2013"><mml:mi>v</mml:mi></mml:math></inline-formula> (bottom row) of the velocity field at time <inline-formula id="ieqn-2014"><mml:math id="mml-ieqn-2014"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> are shown for the FOM, the hyper-reduced nonlinear-manifold-based ROM (NM-LSPG-HR) and the hyper-reduced linear-subspace-based ROM (LS-LSPG-HR). Both ROMs have a dimension of <inline-formula id="ieqn-2015"><mml:math id="mml-ieqn-2015"><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>; with respect to hyper-reduction, the residual basis of the NM-LSPG-HR ROM has a dimension of <inline-formula id="ieqn-2016"><mml:math id="mml-ieqn-2016"><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mn>55</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-2017"><mml:math id="mml-ieqn-2017"><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="bold-italic">Z</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mn>58</mml:mn></mml:mrow></mml:math></inline-formula> components of the full-order residual (&quot;sampling indices&quot;) are used in the gappy reconstruction of the reduced residual. For the LS-LSPG-HR ROM, both the dimension of the residual basis and the number of sampling indices are <inline-formula id="ieqn-2018"><mml:math id="mml-ieqn-2018"><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mn>59</mml:mn></mml:mrow></mml:math></inline-formula>. Due to the advection, a steep, shock wave-like gradient develops. While FOM and NM-LSPG-HR solutions are visually indistinguishable, the LS-LSPG-HR fails to reproduce the FOM&#x2019;s solution by a large margin (right column). Spurious oscillation patterns characteristic of advection-dominated problems (Brooks &amp; Hughes (1982) [<xref ref-type="bibr" rid="ref-316">316</xref>]) occur. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-125.tif"/>
</fig>
</sec>
<sec id="s12_4_5"><label>12.4.5</label>
<title>Numerical example: 2D Burger&#x2019;s equation</title>
<p>As a second example, consider now Burgers&#x2019; equation in two (spatial) dimensions [<xref ref-type="bibr" rid="ref-47">47</xref>], instead of in one dimension as in Section <xref ref-type="sec" rid="s12_4_1">12.4.1</xref>. Additionally, <italic>viscous behavior</italic>, which manifests as the Laplace term on the right-hand side of the following equation, was included as opposed to the one-dimensional problem in Eq. (<xref ref-type="disp-formula" rid="eqn-442">442</xref>):</p>
<p><disp-formula id="eqn-483"><label>(483)</label><mml:math id="mml-eqn-483" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mtext>grad</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>R</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mrow><mml:mtext>div</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mtext>grad</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The problem was solved on a square (unit) domain <inline-formula id="ieqn-2662"><mml:math id="mml-ieqn-2662"><mml:mtext>&#x03A9;</mml:mtext></mml:math></inline-formula>, where homogeneous Dirichlet conditions for the velocity field <inline-formula id="ieqn-2663"><mml:math id="mml-ieqn-2663"><mml:mi mathvariant="bold-italic">u</mml:mi></mml:math></inline-formula> were assumed at all boundaries <inline-formula id="ieqn-2664"><mml:math id="mml-ieqn-2664"><mml:mtext>&#x0393;</mml:mtext><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mtext>&#x03A9;</mml:mtext></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-484"><label>(484)</label><mml:math id="mml-eqn-484" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn></mml:mrow><mml:mspace width="1em" /><mml:mrow><mml:mtext>on</mml:mtext></mml:mrow><mml:mspace width="1em" /><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>An inhomogeneous flow profile under 45 &#x00B0; was prescribed as initial conditions (<inline-formula id="ieqn-2665"><mml:math id="mml-ieqn-2665"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>),</p>
<p><disp-formula id="eqn-485"><label>(485)</label><mml:math id="mml-eqn-485" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mi>&#x03BC;</mml:mi><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0.5</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0.5</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2666"><mml:math id="mml-ieqn-2666"><mml:mi>u</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-2667"><mml:math id="mml-ieqn-2667"><mml:mi>v</mml:mi></mml:math></inline-formula> denoted the Cartesian components of the velocity field, and <inline-formula id="ieqn-2668"><mml:math id="mml-ieqn-2668"><mml:mi>&#x00B5;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0.9</mml:mn><mml:mo>,</mml:mo><mml:mn>1.1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> was a parameter describing the magnitude of the initial velocity field. Viscosity was governed by the Reynolds number <inline-formula id="ieqn-2669"><mml:math id="mml-ieqn-2669"><mml:mi>R</mml:mi><mml:mi>e</mml:mi></mml:math></inline-formula>, i.e., the problem was advection-dominated for high Reynolds number, whereas diffusion prevailed for low Reynolds number.</p>
<fig id="fig-126">
<label>Figure 126</label>
<caption><title><italic>2-D Burger&#x2019;s equation. Reynolds number vs. singular values</italic> (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>). Performing SVD on FOM solution snapshots, which were partitioned into <inline-formula id="ieqn-2019"><mml:math id="mml-ieqn-2019"><mml:mi>x</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2020"><mml:math id="mml-ieqn-2020"><mml:mi>y</mml:mi></mml:math></inline-formula>-components, the influence of the Reynolds number on the singular values is illustrated. In diffusion-dominated problems, which are characterized by low Reynolds number, a rapid decay of singular values was observed. Less than 100 singular values were non-zero (in terms of double precision accuracy) in the present example for <inline-formula id="ieqn-2021"><mml:math id="mml-ieqn-2021"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>100</mml:mn></mml:mrow></mml:math></inline-formula>. In problems with high Reynolds number, in which advection dominates over diffusive processes, the decay of singular values was much slower. As many as 200 singular values were different from zero in the case <inline-formula id="ieqn-2022"><mml:math id="mml-ieqn-2022"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mn>4</mml:mn></mml:msup></mml:math></inline-formula>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-126.tif"/>
</fig>
<p>The semi-discrete FOM was a finite-difference approximation in the spatial dimension on a uniform <inline-formula id="ieqn-2670"><mml:math id="mml-ieqn-2670"><mml:mn>60</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>60</mml:mn></mml:math></inline-formula> grid of points <inline-formula id="ieqn-2671"><mml:math id="mml-ieqn-2671"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2672"><mml:math id="mml-ieqn-2672"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>60</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-2673"><mml:math id="mml-ieqn-2673"><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>60</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. First spatial derivatives were approximated by backward differences; central differences were used to approximate second derivatives. A spatial discretization led to a set of ODEs, which was partitioned into two subsets that corresponded to the two spatial directions:</p>
<p><disp-formula id="eqn-486"><label>(486)</label><mml:math id="mml-eqn-486" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mover accent='true'><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi>u</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="2em"/><mml:mover><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi>u</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="2em"/><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2674"><mml:math id="mml-ieqn-2674"><mml:mi mathvariant="bold-italic">U</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-2675"><mml:math id="mml-ieqn-2675"><mml:mi mathvariant="bold-italic">V</mml:mi></mml:math></inline-formula> comprised the components of velocity vectors at the grid points in <inline-formula id="ieqn-2676"><mml:math id="mml-ieqn-2676"><mml:mi>x</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2677"><mml:math id="mml-ieqn-2677"><mml:mi>y</mml:mi></mml:math></inline-formula>-direction, respectively. For the nonlinear functions <inline-formula id="ieqn-2678"><mml:math id="mml-ieqn-2678"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi>u</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, which follow from the spatial discretization of the advection and diffusion terms [<xref ref-type="bibr" rid="ref-47">47</xref>]. In line with the partitioning the system of equations (486), two separate autoencoders were trained for <inline-formula id="ieqn-2679"><mml:math id="mml-ieqn-2679"><mml:mi mathvariant="bold-italic">U</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2680"><mml:math id="mml-ieqn-2680"><mml:mi mathvariant="bold-italic">V</mml:mi></mml:math></inline-formula>, respectively, since less memory was required as opposed to a single autoencoder for the full set of unknowns <inline-formula id="ieqn-2681"><mml:math id="mml-ieqn-2681"><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">U</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">V</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>.</p>
<p>For time integration, the backward Euler scheme with a constant step size <inline-formula id="ieqn-2682"><mml:math id="mml-ieqn-2682"><mml:mo>&#x0394;</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> was applied, where <inline-formula id="ieqn-2683"><mml:math id="mml-ieqn-2683"><mml:msub><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mn>1500</mml:mn></mml:mrow></mml:math></inline-formula> was used in the example. The solutions corresponding to the parameter values <inline-formula id="ieqn-2684"><mml:math id="mml-ieqn-2684"><mml:mi>&#x00B5;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>0.9</mml:mn><mml:mo>,</mml:mo><mml:mn>0.95</mml:mn><mml:mo>,</mml:mo><mml:mn>1.05</mml:mn><mml:mo>,</mml:mo><mml:mn>1.1</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> were collected as training data, which amounted to a total of <inline-formula id="ieqn-2685"><mml:math id="mml-ieqn-2685"><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mn>6004</mml:mn></mml:mrow></mml:math></inline-formula> snapshots. Ten percent (10%) of the snapshots were retained as validation set (see <xref ref-type="sec" rid="s6_1">Sec. 6.1</xref>); no test set was used.</p>
<p>Figure <xref ref-type="fig" rid="fig-126">126</xref> shows the influence of the Reynolds number on the singular values obtained from solution snapshots of the FOM. For <inline-formula id="ieqn-2686"><mml:math id="mml-ieqn-2686"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>100</mml:mn></mml:mrow></mml:math></inline-formula>, i.e., when diffusion was dominant, the singular values decayed rapidly as compared to an advection-dominated problem with a high Reynolds number of <inline-formula id="ieqn-2687"><mml:math id="mml-ieqn-2687"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mn>4</mml:mn></mml:msup></mml:math></inline-formula>, for which the reduced-order models were constructed in what follows. In other words, the dimensionality of the tangent space of the FOM&#x2019;s solution was more than twice as large in the advection-dominated case, which limited the feasible reduction in dimensionality by means of linear subspace methods. Note that the singular values were the same for both components of the velocity field <inline-formula id="ieqn-2688"><mml:math id="mml-ieqn-2688"><mml:mi>u</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2689"><mml:math id="mml-ieqn-2689"><mml:mi>v</mml:mi></mml:math></inline-formula>. The problem was symmetric about the diagonal from the lower-left (<inline-formula id="ieqn-2690"><mml:math id="mml-ieqn-2690"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-2691"><mml:math id="mml-ieqn-2691"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>) to the upper-right (<inline-formula id="ieqn-2692"><mml:math id="mml-ieqn-2692"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-2693"><mml:math id="mml-ieqn-2693"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>) corner of the domain, which was also reflected in the solution snapshots illustrated in Figure <xref ref-type="fig" rid="fig-125">125</xref>. For this reason, we only show the results related to the <inline-formula id="ieqn-2694"><mml:math id="mml-ieqn-2694"><mml:mi>x</mml:mi></mml:math></inline-formula>-component of the velocity field in what follows.</p>
<p>Both autoencoders (for <inline-formula id="ieqn-2695"><mml:math id="mml-ieqn-2695"><mml:mi mathvariant="bold-italic">U</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-2696"><mml:math id="mml-ieqn-2696"><mml:mi mathvariant="bold-italic">V</mml:mi></mml:math></inline-formula>, respectively) had the same structure. In each encoder, the (single) hidden layer had a width of 6728 neurons, which were referred to as &#x201C;nodes&#x201D; in [<xref ref-type="bibr" rid="ref-47">47</xref>]; the hidden layer of the (sparse) decoder networks was 33730 neurons wide. To train the autoencoders, the Adam algorithm (see Sec. <xref ref-type="sec" rid="s6_5_6">6.5.6</xref> and Algorithm <xref ref-type="fig" rid="fig-163">5</xref>) was used with an initial learning rate of 0.001. The learning rate was decreased by a factor of 10 when the training loss did not decrease for 10 successive epochs. The batch size was 240, and training was stopped either after the maximum of 10000 epochs or, alternatively, once the validation loss had stopped to decrease for 200 epochs. The (single) hidden layers of the encoder networks had a width of 6728 neurons; with 33730 neurons , the decoders&#x2019; hidden layers were almost five times wider The parameters of all neural networks were initialized according to the <italic>Kaiming He</italic> initialization [<xref ref-type="bibr" rid="ref-61">61</xref>].</p>
<p>To evaluate the accuracy of the NM-ROMs proposed in [<xref ref-type="bibr" rid="ref-47">47</xref>], the Burgers&#x2019; equation was solved for the target parameter <inline-formula id="ieqn-2697"><mml:math id="mml-ieqn-2697"><mml:mi>&#x00B5;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, for which no solution snapshots were included in the training data. Figure <xref ref-type="fig" rid="fig-127">127</xref> compares the relative errors (Eq. (<xref ref-type="disp-formula" rid="eqn-444">444</xref>)) of the nonlinear-manifold-based and linear-projection-based ROMs as a function of the reduced dimension <inline-formula id="ieqn-2698"><mml:math id="mml-ieqn-2698"><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula>. Irrespective of whether a Galerkin or Petrov-Galerkin approach was used, nonlinear-manifold based ROMs (NM-Galerkin, NM-LSPG) were superior to their linear-subspace counterparts (LS-Galerkin, LS-LSPG). The figure also shows the so-called <italic>projections errors</italic>, which are lower bounds for the relative errors of linear-subspace and nonlinear-manifold-based errors, see [<xref ref-type="bibr" rid="ref-47">47</xref>] for their definitions. Note that the relative error of the NM-LSPG was smaller than the lower error bound of linear-subspace ROMs. As noted in [<xref ref-type="bibr" rid="ref-47">47</xref>], linear-subspace ROMs performed relatively poorly for the problem at hand, and even failed to converge. Both, the LS-Galerkin and the LS-LSPG-ROM showed relative errors of 1 if their dimension was 10 or more in the present problem. The NM-Galerkin ROM fell behind the NM-LSPG ROM in terms of accuracy. We also note that both NM-based ROMs hardly showed any reduction in error if their dimension was increased beyond five.</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-47">47</xref>] also studied the impact how the size of the parameter set <inline-formula id="ieqn-2699"><mml:math id="mml-ieqn-2699"><mml:msub><mml:mi>&#x1D49F;</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, which translated into the amount of training data, affected the accuracy of ROMs. For this purpose, parameter sets with <inline-formula id="ieqn-2700"><mml:math id="mml-ieqn-2700"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn><mml:mo>,</mml:mo><mml:mn>6</mml:mn><mml:mo>,</mml:mo><mml:mn>8</mml:mn></mml:math></inline-formula> values of <inline-formula id="ieqn-2701"><mml:math id="mml-ieqn-2701"><mml:mi>&#x00B5;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow></mml:math></inline-formula>, which were referred to as &#x201C;parameter instances,&#x201D; were created, where the target value <inline-formula id="ieqn-2702"><mml:math id="mml-ieqn-2702"><mml:mi>&#x00B5;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> remained excluded, i.e.,</p>
<fig id="fig-127">
<label>Figure 127</label>
<caption><title><italic>2D-Burgers&#x2019; equation: relative errors of nonlinear manifold and linear subspace ROMs</italic> (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>). (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-127.tif"/>
</fig>
<table-wrap id="table-7"><label>Table 7</label>
<caption>
<p><italic>2-D Burger&#x2019;s equation. Juxtaposition of hyper-reduced ROMs: speed-up and accuracy</italic> (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>). The 6 respective best least-squares Petrov-Galerkin ROMs built upon nonlinear manifold approximation by means of autoencoders and linear subspaces are compared, where &#x2018;best&#x2019; refers to the maximum error relative to the FOM (Eq. (<xref ref-type="disp-formula" rid="eqn-444">444</xref>)). The optimal dimension of the basis <inline-formula id="ieqn-428"><mml:math id="mml-ieqn-428"><mml:msub><mml:mi>&#x03A6;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula> and the number of sampling indices <italic>p<sub>j</sub></italic> used in the gappy reconstruction of the nonlinear residual lie in similar ranges for all ROMs listed. The hyper-reduced ROMs achieve speed-up in wall-clock time of factors of approximately 11 for the nonlinearmanifold-based approach up to a factor of almost 30 in linear-subspace ROMs. While the former show a maximum relative error of below 1 %, the latter fail to reproduce the FOM&#x2019;s behavior by a large margin. (Table reproduced with permission of the authors.)</p></caption>
<table>
<colgroup>
<col/>
</colgroup>
<tbody>
<tr>
<td align="center"><graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-167.tif"/></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><disp-formula id="eqn-487"><label>(487)</label><mml:math id="mml-eqn-487" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x1D49F;</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>0.9</mml:mn><mml:mo>+</mml:mo><mml:mn>0.2</mml:mn><mml:mi>i</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2216;</mml:mi><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>;</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>the reduced dimension was set to <inline-formula id="ieqn-2703"><mml:math id="mml-ieqn-2703"><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>. Figure <xref ref-type="fig" rid="fig-127">127</xref> (right) reveals that, for the NM-LSPG ROM, 4 &#x201C;parameter instances&#x201D; were sufficient to reduce the maximum relative error below 1 % in the present problem. None of the ROMs benefited from increasing the parameter set, for which the training data were generated.</p>
<p>Hyper-reduction turned out to be crucial with respect to computational efficiency. For a reduced dimension of <inline-formula id="ieqn-2704"><mml:math id="mml-ieqn-2704"><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>, all ROMs except for the NM-LSPG ROM, which achieved a minor speedup, were less efficient than the FOM in terms of wall-clock time. The dimension of the residual basis <inline-formula id="ieqn-2705"><mml:math id="mml-ieqn-2705"><mml:msub><mml:mi>n</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula> and the number of sampling indices <inline-formula id="ieqn-2706"><mml:math id="mml-ieqn-2706"><mml:msub><mml:mi>n</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:math></inline-formula> were both varied in the range from 40 to 60 to quantify their relation to the maximum relative error. For this purpose, the number of training instances was again set to <inline-formula id="ieqn-2707"><mml:math id="mml-ieqn-2707"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> and the reduced dimension was fixed to <inline-formula id="ieqn-2708"><mml:math id="mml-ieqn-2708"><mml:msub><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>. Table <xref ref-type="table" rid="table-7">7</xref> compares the 6 best&#x2013;in terms of maximum error relative to the FOM&#x2013;hyper-reduced least-squares Petrov-Galerkin ROMs based on nonlinear manifolds and linear subspaces, respectively. The NM-LSPG-HR ROM in [<xref ref-type="bibr" rid="ref-47">47</xref>] was able to achieve a speed-up of more than a factor of 11 while keeping the maximum relative error below 1 %. Though the speed-up of the linear-subspace counterpart was more than twice as large, relative errors beyond 34 % rendered these ROMs worthless.</p>
<fig id="fig-128">
<label>Figure 128</label>
<caption><title><italic>Machine-learning accelerated CFD</italic> (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>). Speed-up factor, compared to direct integration, was much higher than those obtained from nonlinear model-order reduction in Table <xref ref-type="table" rid="table-7">7</xref> [<xref ref-type="bibr" rid="ref-317">317</xref>]. <ext-link ext-link-type="uri" xlink:href="https://www.pnas.org/page/about/rights-permissions.tif">Permission of NAS</ext-link>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-128.tif"/>
</fig><fig id="fig-129">
<label>Figure 129</label>
<caption><title><italic>Machine-learning accelerated CFD</italic> (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>). Good accuracy and good generalization, devoiding of non-physical solutions [<xref ref-type="bibr" rid="ref-317">317</xref>]. <ext-link ext-link-type="uri" xlink:href="https://www.pnas.org/page/about/rights-permissions.tif">Permission of NAS</ext-link>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-129.tif"/>
</fig>
<fig id="fig-130">
<label>Figure 130</label>
<caption><title><italic>Machine-learning accelerated CFD</italic> (Section <xref ref-type="sec" rid="s12_4_5">12.4.5</xref>). The neural network generates interpolation coefficients based on local-flow properties, while ensuring at least first-order accuracy relative to the grid spacing [<xref ref-type="bibr" rid="ref-317">317</xref>]. <ext-link ext-link-type="uri" xlink:href="https://www.pnas.org/page/about/rights-permissions.tif">Permission of NAS</ext-link>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-130.tif"/>
</fig>
<statement id="st12_8"><title>Remark 12.8.</title>
<p><italic>Machine-learning accelerated CFD</italic>. A hybrid method between traditional direct integration of the Navier-Stokes equation and machine learning (ML) interpolation was presented in [<xref ref-type="bibr" rid="ref-317">317</xref>] (Figure <xref ref-type="fig" rid="fig-130">130</xref>), where a speed-up factor close to 90, many times higher than those in Table <xref ref-type="table" rid="table-7">7</xref>, was obtained, Figure <xref ref-type="fig" rid="fig-128">128</xref>, while generalizing well (Figure <xref ref-type="fig" rid="fig-129">129</xref>). Grounded on the traditional direct integration, such hybrid method would avoid non-physical solutions of pure machine-learning methods, such as the physics-inspired machine learning (Section <xref ref-type="sec" rid="s9_5">9.5</xref>, Remark <xref ref-type="statement" rid="st9_4">9.4</xref>), maintain higher accuracy as obtained with direct integration, and at the same time benefit from an acceleration from the learned interpolation.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<statement id="st12_9"><title>Remark 12.9.</title>
<p>In concluding this section, we mention the 2023 review paper [<xref ref-type="bibr" rid="ref-318">318</xref>], brought to our attention by a reviewer, on &#x201C;A state-of-the-art review on machine learning-based multiscale modeling, simulation, homogenization and design of materials.&#x201D; This review paper would nicely complement our present review paper.&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement></sec></sec></sec>
<sec id="s13"><label>13</label>
<title>Historical perspective</title>
<sec id="s13_1"><label>13.1</label>
<title>Early inspiration from biological neurons</title>
<p>In the early days, many papers on artificial neural networks, particularly for applications in engineering, started to motivate readers with a figure of a biological neuron as in Figure <xref ref-type="fig" rid="fig-131">131</xref> (see, e.g., [<xref ref-type="bibr" rid="ref-23">23</xref>], Figure 1a), before displaying an artificial neuron (e.g, [<xref ref-type="bibr" rid="ref-23">23</xref>], Figure 1b). When artificial neural networks took a foothold in the research community, there was no need to motivate with a biological neuron, e.g., [<xref ref-type="bibr" rid="ref-38">38</xref>] [<xref ref-type="bibr" rid="ref-20">20</xref>], which began directly with an artificial neuron.</p>
<fig id="fig-131">
<label>Figure 131</label>
<caption><title><italic>Biological Neuron and signal flow</italic> (Sections <xref ref-type="sec" rid="s4_4_4">4.4.4</xref>, <xref ref-type="sec" rid="s13_1">13.1</xref>, <xref ref-type="sec" rid="s13_2_2">13.2.2</xref>) along myelinated axon, with inputs at the synapses (input points) in the dendrites and with outputs at the axon terminals (output points,which are also the synapses for the next neuron). Each input current <inline-formula id="ieqn-2023"><mml:math id="mml-ieqn-2023"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is multiplied by the weight <inline-formula id="ieqn-2024"><mml:math id="mml-ieqn-2024"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, then all weighted input currents are summed together (linear combination), with <inline-formula id="ieqn-2025"><mml:math id="mml-ieqn-2025"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, to form the total synaptic input current <inline-formula id="ieqn-2026"><mml:math id="mml-ieqn-2026"><mml:msub><mml:mi>I</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> into the soma (cell body). The corresponding artificial neuron is in Figure <xref ref-type="fig" rid="fig-36">36</xref> in Section <xref ref-type="sec" rid="s4_4_4">4.4.4</xref>. (Figure adapted from Wikipedia <ext-link ext-link-type="uri" xlink:href="https://commons.wikimedia.org/w/index.php?title=File:Neuron3.svg&amp;oldid=348383690.tif">version 14:29, 2 May 2019</ext-link>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-131.tif"/>
</fig>
</sec>
<sec id="s13_2"><label>13.2</label>
<title>Spatial / temporal combination of inputs, weights, biases</title>
<p>Both [<xref ref-type="bibr" rid="ref-21">21</xref>] and [<xref ref-type="bibr" rid="ref-78">78</xref>] referred to Rosenblatt (1958) [<xref ref-type="bibr" rid="ref-119">119</xref>], who first proposed using a linear combination of inputs with weights, and with biases (thresholds). The authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 14, only mentioned the &#x201C;linear model&#x201D; defined&#x2014;using the notation convention in Eq. (<xref ref-type="disp-formula" rid="eqn-16">16</xref>)&#x2014;as</p>
<p><disp-formula id="eqn-488"><label>(488)</label><mml:math id="mml-eqn-488" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>without the bias. On the other hand, it was written in [<xref ref-type="bibr" rid="ref-21">21</xref>] that &#x201C;Rosenblatt proposed a simple rule to compute the output. He introduced weights, real numbers expressing the importance of the respective inputs to the output&#x201D; and &#x201C;some threshold value,&#x201D; and attributed the following equation to Rosenblatt</p>
<p><disp-formula id="eqn-489"><label>(489)</label><mml:math id="mml-eqn-489" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>output</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mrow><mml:mtext>&#xA0;if&#xA0;</mml:mtext></mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>threshold</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#xA0;if&#xA0;</mml:mtext></mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&gt;</mml:mo><mml:mrow><mml:mtext>threshold</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the threshold is simply the negative of the bias <inline-formula id="ieqn-2710"><mml:math id="mml-ieqn-2710"><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>)<xref ref-type="fn" rid="fn293"><sup>293</sup></xref><fn id="fn293"><label>293</label><p>&#x201C;The Perceptron&#x2019;s design was much like that of the modern neural net, except that it had only one layer with adjustable weights and thresholds, sandwiched between input and output layers&#x201D; [<xref ref-type="bibr" rid="ref-77">77</xref>]. In the neuroscientific terminology that Rosenblatt (1958) [<xref ref-type="bibr" rid="ref-119">119</xref>] used, the input layer contains the sensory units, the middle (hidden) layer contains the &#x201C;association units,&#x201D; and the output layer contains the response units. Due to the difference in notation and due to &#x201C;neurodynamics&#x201D; as a new field for most readers, we provide here some markers that could help track down where Rosenblatt used linear combination of the inputs. Rosenblatt (1962) [<xref ref-type="bibr" rid="ref-2">2</xref>], p. 82, defined the &#x201C;transmission function <inline-formula id="ieqn-3327"><mml:math id="mml-ieqn-3327"><mml:msubsup><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula>&#x201D; for the connection between two &#x201C;units&#x201D; (neurons) <inline-formula id="ieqn-3328"><mml:math id="mml-ieqn-3328"><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-3329"><mml:math id="mml-ieqn-3329"><mml:msub><mml:mi>u</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, with <inline-formula id="ieqn-3330"><mml:math id="mml-ieqn-3330"><mml:msubsup><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula> playing the same role as that of the term <inline-formula id="ieqn-3331"><mml:math id="mml-ieqn-3331"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> in <inline-formula id="ieqn-3332"><mml:math id="mml-ieqn-3332"><mml:msubsup><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>). Then for an &#x201C;elementary perceptron&#x201D;, the transmission function <inline-formula id="ieqn-3333"><mml:math id="mml-ieqn-3333"><mml:msubsup><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula> was defined in [<xref ref-type="bibr" rid="ref-2">2</xref>], p. 85, to be equal to the output of unit <inline-formula id="ieqn-3334"><mml:math id="mml-ieqn-3334"><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> (equivalent to <inline-formula id="ieqn-3335"><mml:math id="mml-ieqn-3335"><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>) multiplied by the &#x201C;coupling coefficient&#x201D; <inline-formula id="ieqn-3336"><mml:math id="mml-ieqn-3336"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> (between unit <inline-formula id="ieqn-3337"><mml:math id="mml-ieqn-3337"><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and unit <inline-formula id="ieqn-3338"><mml:math id="mml-ieqn-3338"><mml:msub><mml:mi>u</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>), with <inline-formula id="ieqn-3339"><mml:math id="mml-ieqn-3339"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> being the equivalent of the weight <inline-formula id="ieqn-3340"><mml:math id="mml-ieqn-3340"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>), ignoring the time dependence. The word &#x201C;weight,&#x201D; meaning coefficient, was not used often in [<xref ref-type="bibr" rid="ref-2">2</xref>], and not at all in [<xref ref-type="bibr" rid="ref-119">119</xref>].</p></fn> or <inline-formula id="ieqn-2711"><mml:math id="mml-ieqn-2711"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-132">132</xref>, which is a graphical representation of Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>). The author of [<xref ref-type="bibr" rid="ref-21">21</xref>] went on to say, &#x201C;That&#x2019;s all there is to how a perceptron works!&#x201D; Such statement could be highly misleading for first-time learners in discounting Rosenblatt&#x2019;s important contributions, which were extensively inspired from neuroscience, and were not limited to the perceptron as a machine-learning algorithm, but also to the development of the Mark I computer, a hardware implementation of the perceptron; see Figure <xref ref-type="fig" rid="fig-133">133</xref> [<xref ref-type="bibr" rid="ref-319">319</xref>] [<xref ref-type="bibr" rid="ref-120">120</xref>] [<xref ref-type="bibr" rid="ref-320">320</xref>].</p>
<fig id="fig-132">
<label>Figure 132</label>
<caption><title>The <italic>perceptron network</italic> (Sections <xref ref-type="sec" rid="s4_5">4.5</xref>, <xref ref-type="sec" rid="s13_2">13.2</xref>)&#x2014;introduced by Rosenblatt (1958) [<xref ref-type="bibr" rid="ref-119">119</xref>], (1962) [<xref ref-type="bibr" rid="ref-120">120</xref>]&#x2014;has a linear combination with weights and bias as expressed in <inline-formula id="ieqn-2027"><mml:math id="mml-ieqn-2027"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">w</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:math></inline-formula>, but differs from the one-layer network in Figure <xref ref-type="fig" rid="fig-37">37</xref> in that it is activated by the Heaviside function. That the Rosenblatt perceptron cannot represent the XOR function; see Section <xref ref-type="sec" rid="s4_5">4.5</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-132.tif"/>
</fig>
<p>Moreover, adding to the confusion for first-time learners, another error and misleading statement about the &#x201C;Rosenblatt perceptron&#x201D; in connection with Eq. (<xref ref-type="disp-formula" rid="eqn-488">488</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>)&#x2014;which represent a single neuron&#x2014;is in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 13, where it was stated that the &#x201C;Rosenblatt perceptron&#x201D; involved only a &#x201C;single neuron&#x201D;:</p>
<disp-quote><p>&#x201C;The first wave started with cybernetics in the 1940s-1960s, with the development of theories of biological learning (McCulloch and Pitts, 1943; Hebb, 1949) and implementations of the first models, such as the perceptron (Rosenblatt, 1958), enabling the training of a single neuron.&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 13.</p>
</disp-quote><p>The error of considering the Rosenblatt perceptron as having a &#x201C;single neuron&#x201D; is also reported in Figure <xref ref-type="fig" rid="fig-42">42</xref>, which is Figure 1.11 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 23. But the Rosenblatt perceptron as described in the cited reference Rosenblatt (1958) [<xref ref-type="bibr" rid="ref-119">119</xref>] and in Rosenblatt (1960) [<xref ref-type="bibr" rid="ref-319">319</xref>] was a network, called a &#x201C;nerve net&#x201D;:</p>
<disp-quote><p>&#x201C;Any perceptron, or nerve net, consists of a network of &#x201C;cells,&#x201D; or signal generating units, and connections between them.&#x201D;</p>
</disp-quote>
<fig id="fig-133">
<label>Figure 133</label>
<caption><title><italic>Rosenblatt and the Mark I computer</italic> (Sections <xref ref-type="sec" rid="s4_6_1">4.6.1</xref>, <xref ref-type="sec" rid="s13_2">13.2</xref>) based on the perceptron, described in the New York Times article titled &#x201C;New Navy device learns by doing&#x201D; on 1958 July 8 (<ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190912024703/http://jcblackmon.com/wp-content/uploads/2018/01/MBC-Rosenblatt-Perceptron-NYT-article.jpg.pdf">Internet archive</ext-link>), as a &#x201C;computer designed to read and grow wiser&#x201D;, and would be able to &#x201C;walk, talk, see, write, reproduce itself and be conscious of its existence. The first perceptron will have about 1,000 electronic &#x201C;association cells&#x201D; [A-units] receiving electrical impulses from an eye-like scanning device with 400 photo cells&#x201D;. See also the <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=cNxadbrN_al">Youtube video</ext-link> &#x201C;Perceptron Research from the 50&#x2019;s &#x0026; 60&#x2019;s, clip&#x201D;. Sometimes, it is incorrectly thought that Rosenblatt&#x2019;s network had only one neuron (A-unit); see Figure <xref ref-type="fig" rid="fig-42">42</xref>, Section <xref ref-type="sec" rid="s4_6_1">4.6.1</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-133.tif"/>
</fig>
<p>Such a &#x201C;nerve net&#x201D; would surely not just contain a &#x201C;single neuron&#x201D;. Indeed, the report by Rosenblatt (1957) [<xref ref-type="bibr" rid="ref-1">1</xref>] that appeared a year earlier mentioned a network (with one layer) containing as many a thousand neurons, called &#x201C;association cells&#x201D; (or A-units):<xref ref-type="fn" rid="fn294"><sup>294</sup></xref><fn id="fn294"><label>294</label><p>We were so thoroughly misled in thinking that the &#x201C;Rosenblatt perceptron&#x201D; was a single neuron that we were surprised to learn that Rosenblatt had built the Mark I computer with many neurons.</p></fn></p>
<disp-quote><p>&#x201C;Thus with 1000 A-units connected to each R-unit [response unit or output], and a system in which 1% of the A-units respond to stimuli of a given size (i.e., <inline-formula id="ieqn-2712"><mml:math id="mml-ieqn-2712"><mml:msub><mml:mi>P</mml:mi><mml:mi>a</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>.01</mml:mn></mml:math></inline-formula>), the probability of making a correct discrimination with one unit of training, after <inline-formula id="ieqn-2713"><mml:math id="mml-ieqn-2713"><mml:msup><mml:mn>10</mml:mn><mml:mn>6</mml:mn></mml:msup></mml:math></inline-formula> stimuli have been associated to each response in the system, is equal to the 2.23 sigma level, or a probability of 0.987 of being correct.&#x201D; [<xref ref-type="bibr" rid="ref-1">1</xref>], p. 16.</p>
</disp-quote><p>The perceptron with one thousand A-units mentioned in [<xref ref-type="bibr" rid="ref-1">1</xref>] was also reported in the New York Times article &#x201C;New Navy device learns by doing&#x201D; on 1958 July 8 (<ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190912024703/http://jcblackmon.com/wp-content/uploads/2018/01/MBC-Rosenblatt-Perceptron-NYT-article.jpg.pdf">Internet archive</ext-link>); see Figure <xref ref-type="fig" rid="fig-133">133</xref>. Even if the report by Rosenblatt (1957) [<xref ref-type="bibr" rid="ref-1">1</xref>] were not immediately accessible, it was stated in no uncertain terms that the perceptron was a machine with many neurons:</p>
<disp-quote><p>&#x201C;The organization of a typical photo-perceptron (a perceptron responding To optical patterns as stimuli) is shown In Figure1. ... [Rule] 1. Stimuli impinge on a retina of Sensory units (S-points), which are Assumed to respond on an all-or-nothing basis<xref ref-type="fn" rid="fn295"><sup>295</sup></xref><fn id="fn295"><label>295</label><p>Heaviside activation function, see Figure <xref ref-type="fig" rid="fig-132">132</xref> for the case of one neuron.</p></fn>. [Rule] 2. Impulses are transmitted to a set Of association cells (A-units) [neurons]... If the algebraic sum of excitatory and Inhibitory impulse intensities<xref ref-type="fn" rid="fn296"><sup>296</sup></xref><fn id="fn296"><label>296</label><p>Weighted sum / voting (or linear combination) of inputs; see Eq. (<xref ref-type="disp-formula" rid="eqn-488">488</xref>), Eq. (<xref ref-type="disp-formula" rid="eqn-493">493</xref>), Eq. (<xref ref-type="disp-formula" rid="eqn-494">494</xref>).</p></fn> is equal To or greater than the threshold<xref ref-type="fn" rid="fn297"><sup>297</sup></xref><fn id="fn297"><label>297</label><p>The negative of the bias <inline-formula id="ieqn-3341"><mml:math id="mml-ieqn-3341"><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>).</p></fn> (<italic>&#x03B8;</italic>) of The A-unit, then the A-unit fires, again On an all-or-nothing basis.&#x201D; [<xref ref-type="bibr" rid="ref-119">119</xref>]</p>
</disp-quote><p>Figure 1 in [<xref ref-type="bibr" rid="ref-119">119</xref>] described a network (&#x201C;nerve net&#x201D;) with many A-units (neurons). Does anyone still read the classics anymore?</p>
<p>Rosenblatt&#x2019;s (1962) book [<xref ref-type="bibr" rid="ref-2">2</xref>], p. 33, provided the following neuroscientific explanation for using a linear combination (weight sum / voting) of the inputs in both time and space:</p>
<disp-quote><p>&#x201C;The arrival of a single (excitatory) impulse gives rise to a partial depolarization of the post-synaptic<xref ref-type="fn" rid="fn298"><sup>298</sup></xref><fn id="fn298"><label>298</label><p>Refer to Figure <xref ref-type="fig" rid="fig-131">131</xref>. A <italic>synapse</italic> (meaning &#x201C;junction&#x201D;) is &#x201C;a structure that permits a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron&#x201D;, and consists of three parts: the <italic>presynaptic</italic> part (which is an axon terminal of an upstream neuron from which the signal came), the gap called the synaptic cleft, and the <italic>postsynaptic</italic> part, located on a dendrite or on the neuron cell body (called the <italic>soma</italic>); [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 6; &#x201C;Synapse&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Synapse&amp;oldid=885986208">version 16:33, 3 March 2019</ext-link>. A <italic>dendrite</italic> is a conduit for transmitting the electrochemical signal received from another neuron, and passing through a synapse located on that dendrite; &#x201C;Dendrite&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Dendrite&amp;oldid=892648749">version 23:39, 15 April 2019</ext-link>. A synapse is thus an input point to a neuron in a biological neural network. An axon, or nerve fiber, is a &#x201C;long, slender projection of a neuron, that conducts electrical impulses known as action potentials&#x201D; away from the soma to the axon terminals, which are the presynaptic parts; &#x201C;Axon terminal&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Axon_terminal&amp;oldid=885382347">version 18:13, 27 February 2019</ext-link>.</p></fn> membrane surface, which spreads over an appreciable area, and decays exponentially with time. This is called a local excitatory state (l.e.s.). The l.e.s. due to successive impulses is (approximately) additive. Several impulses arriving in sufficiently close succession may thus combine to touch off an impulse in the receiving neuron if the local excitatory state at the base of the axon achieves the threshold level. This phenomenon is called <underline>temporal summation</underline>. Similarly, impulses which arrive at different points on the cell body or on the dendrites may combine by <underline>spatial summation</underline> to trigger an impulse if the l.e.s. induced at the base of the axon is strong enough.&#x201D;</p>
</disp-quote><p>The <italic>spatial summation</italic> of the input synaptic currents is also consistent with Kirchhoff&#x2019;s current law of summing the electrical currents at a junction in an electrical network.<xref ref-type="fn" rid="fn299"><sup>299</sup></xref><fn id="fn299"><label>299</label><p>See &#x201C;Kirchhoff&#x2019;s circuit laws&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Kirchhoff%27s_circuit_laws&amp;oldid=895023220">version 14:24, 1 May 2019</ext-link>.</p></fn> We first look at linear combination in the static case, followed by the dynamic case with Volterra series.</p>
<sec id="s13_2_1"><label>13.2.1</label>
<title>Static, comparing modern to classic literature</title> <disp-quote>
<p><italic>&#x201C;A classic is something that everybody wants to have read and nobody wants to read.&#x201D;</italic></p>
<attrib>Mark Twain</attrib></disp-quote><p>Readers not interested in reading the classics can skip this section. Here, we will not review the perceptron algorithm,<xref ref-type="fn" rid="fn300"><sup>300</sup></xref><fn id="fn300"><label>300</label><p>See, e.g., &#x201C;Perceptron&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Perceptron&amp;oldid=896433053">version 13:11, 10 May 2019</ext-link>, and many other references.</p></fn> but focus our attention on the historical details not found in many modern references, and connect Eq. (<xref ref-type="disp-formula" rid="eqn-488">488</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>) to the original paper by Rosenblatt (1958) [<xref ref-type="bibr" rid="ref-119">119</xref>]. But the problem is that such task is not directly obvious for readers of modern literature, such as [<xref ref-type="bibr" rid="ref-78">78</xref>] for the following reasons:</p>
<list list-type="bullet">
<list-item><p>Rosenblatt&#x2019;s work in [<xref ref-type="bibr" rid="ref-119">119</xref>] was based on neuroscience, which is confusing to those without this background;</p></list-item>
<list-item><p>Unfamiliar notations and concepts for readers coming from deep-learning literature, such as [<xref ref-type="bibr" rid="ref-21">21</xref>], [<xref ref-type="bibr" rid="ref-78">78</xref>];</p></list-item>
<list-item><p>The word &#x201C;weight&#x201D; was not used at all in [<xref ref-type="bibr" rid="ref-119">119</xref>], and thus cannot be used to indirectly search for hints of equations similar to Eq. (<xref ref-type="disp-formula" rid="eqn-488">488</xref>) or Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>);</p></list-item>
<list-item><p>The word &#x201C;threshold&#x201D; was used several times, such as in the sentence &#x201C;If the algebraic sum of excitatory and inhibitory impulse intensities is equal to or greater than the threshold (<italic>&#x03B8;</italic>)&#x201D;<xref ref-type="fn" rid="fn301"><sup>301</sup></xref><fn id="fn301"><label>301</label><p>Of course, the notation <inline-formula id="ieqn-3342"><mml:math id="mml-ieqn-3342"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> (lightface) here does not designate the set of parameters, denoted by <inline-formula id="ieqn-3343"><mml:math id="mml-ieqn-3343"><mml:mrow><mml:mi mathvariant='bold-italic'>&#x03B8;</mml:mi></mml:mrow></mml:math></inline-formula> (boldface) in Eq. (<xref ref-type="disp-formula" rid="eqn-31">31</xref>).</p></fn> of the A-unit, then the A-unit fires, again on an all-or-nothing basis&#x201D;. The threshold <italic>&#x03B8;</italic> is used in Eq. (<xref ref-type="disp-formula" rid="eqn-2">2</xref>) of [<xref ref-type="bibr" rid="ref-119">119</xref>]:</p></list-item>
</list>
<p><disp-formula id="eqn-490"><label>(490)</label><mml:math id="mml-eqn-490" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>e</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2265;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2717"><mml:math id="mml-ieqn-2717"><mml:mi>e</mml:mi></mml:math></inline-formula> is the number excitatory stimulus components received by the A-unit (neuron, or associated unit), <inline-formula id="ieqn-2718"><mml:math id="mml-ieqn-2718"><mml:mi>i</mml:mi></mml:math></inline-formula> the number of inhibitory stimulus components, <inline-formula id="ieqn-2719"><mml:math id="mml-ieqn-2719"><mml:msub><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:msub></mml:math></inline-formula> the number of lost excitatory components, <inline-formula id="ieqn-2720"><mml:math id="mml-ieqn-2720"><mml:msub><mml:mi>l</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> the number of lost inhibitory components, <inline-formula id="ieqn-2721"><mml:math id="mml-ieqn-2721"><mml:msub><mml:mi>g</mml:mi><mml:mi>e</mml:mi></mml:msub></mml:math></inline-formula> the number of gained excitatory components, <inline-formula id="ieqn-2722"><mml:math id="mml-ieqn-2722"><mml:msub><mml:mi>g</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> the number of gained inhibitory components. But all these quantities are positive integers, and thus would not be the real-number weights <inline-formula id="ieqn-2723"><mml:math id="mml-ieqn-2723"><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>), of which the inputs <inline-formula id="ieqn-2724"><mml:math id="mml-ieqn-2724"><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> also have no clear equivalence in Eq. (<xref ref-type="disp-formula" rid="eqn-490">490</xref>).</p>
<p>As will be shown below, it was misleading to refer to [<xref ref-type="bibr" rid="ref-119">119</xref>] for equations such as Eq. (<xref ref-type="disp-formula" rid="eqn-488">488</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>), even though [<xref ref-type="bibr" rid="ref-119">119</xref>] contained the seed ideas leading to these equations upon refinement as presented in [<xref ref-type="bibr" rid="ref-120">120</xref>], which was in turn based on the book by Rosenblatt (1962) [<xref ref-type="bibr" rid="ref-2">2</xref>].</p>
<p>Instead of a direct reading of [<xref ref-type="bibr" rid="ref-119">119</xref>], we suggest reading key publications in reverse chronological orders. We also use the original notations to help readers to identify quickly the relevant equations in the classic literature.</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-121">121</xref>] introduced a general class of machines, each known under different names, but decided to call all these machines as &#x201C;perceptrons&#x201D; in honor of the pioneering work of Rosenblatt. General perceptrons were defined in [<xref ref-type="bibr" rid="ref-121">121</xref>], p. 10, as follows. Let <inline-formula id="ieqn-2725"><mml:math id="mml-ieqn-2725"><mml:msub><mml:mrow><mml:mi>&#x03C6;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> be the ith image characteristic, called an image predicate, which consists of a verb and an object, such as &#x201C;is a circle&#x201D;, &#x201C;is a convex figure&#x201D;, etc. An image predicate is also known as an image feature.<xref ref-type="fn" rid="fn302"><sup>302</sup></xref><fn id="fn302"><label>302</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 3. Another example of a feature is a piece of information about a patient for medical diagnostics. &#x201C;For many tasks, it is difficult to know which features should be extracted.&#x201D; For example, to detect cars, we can try to detect the wheels, but &#x201C;it is difficult to describe exactly what a wheel looks like in terms of pixel values&#x201D;, due to shadows, glares, objects obscuring parts of a wheel, etc.</p></fn> For example, let the ith image characteristic is whether an image &#x201C;is a circle&#x201D;, then <inline-formula id="ieqn-2726"><mml:math id="mml-ieqn-2726"><mml:msub><mml:mrow><mml:mi>&#x03C6;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2261;</mml:mo><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. If an image <inline-formula id="ieqn-2727"><mml:math id="mml-ieqn-2727"><mml:mi>X</mml:mi></mml:math></inline-formula> is a circle, then <inline-formula id="ieqn-2728"><mml:math id="mml-ieqn-2728"><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>; if not, <inline-formula id="ieqn-2729"><mml:math id="mml-ieqn-2729"><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-491"><label>(491)</label><mml:math id="mml-eqn-491" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#xA0;if&#xA0;</mml:mtext></mml:mrow><mml:mi>X</mml:mi><mml:mrow><mml:mtext>&#xA0;is a circle</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mrow><mml:mtext>&#xA0;if&#xA0;</mml:mtext></mml:mrow><mml:mi>X</mml:mi><mml:mrow><mml:mtext>&#xA0;is not a circle</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Let <inline-formula id="ieqn-2730"><mml:math id="mml-ieqn-2730"><mml:mtext>&#x03A6;</mml:mtext></mml:math></inline-formula> be a family of simple image predicates:</p>
<p><disp-formula id="eqn-492"><label>(492)</label><mml:math id="mml-eqn-492" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>A <italic>general</italic> perceptron was defined as a more complex predicate, denoted by <italic>&#x03C8;</italic>, which was a <italic>weighted voting</italic> or <italic>linear combination</italic> of the simple predicates in <inline-formula id="ieqn-2732"><mml:math id="mml-ieqn-2732"><mml:mtext>&#x03A6;</mml:mtext></mml:math></inline-formula> such that<xref ref-type="fn" rid="fn303"><sup>303</sup></xref><fn id="fn303"><label>303</label><p>[<xref ref-type="bibr" rid="ref-121">121</xref>], p. 10.</p></fn></p>
<p><disp-formula id="eqn-493"><label>(493)</label><mml:math id="mml-eqn-493" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03C8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">&#x27FA;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>&gt;</mml:mo><mml:mi>&#x03B8;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-2733"><mml:math id="mml-ieqn-2733"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> being the weight associated with the ith image predicate <inline-formula id="ieqn-2734"><mml:math id="mml-ieqn-2734"><mml:msub><mml:mi>&#x03C6;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, and <italic>&#x03B8;</italic> the threshold or the negative of the bias. As such &#x201C;each predicate of &#x03A6; is supposed to provide some evidence about whether <italic>&#x03C8;</italic> is true for any figure <italic>X</italic>.&#x201D;<xref ref-type="fn" rid="fn304"><sup>304</sup></xref><fn id="fn304"><label>304</label><p>[<xref ref-type="bibr" rid="ref-121">121</xref>], p. 11.</p></fn> The expression on the right of the equivalence sign, written with the notations used here, is the general case of Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>). The authors of [<xref ref-type="bibr" rid="ref-121">121</xref>], p. 12, then defined the Rosenblatt perceptron as a special case of Eq. (<xref ref-type="disp-formula" rid="eqn-493">493</xref>) in which the image predicates in <inline-formula id="ieqn-2739"><mml:math id="mml-ieqn-2739"><mml:mtext>&#x03A6;</mml:mtext></mml:math></inline-formula> were random Boolean functions, generated by a random process according to a probability distribution.</p>
<p>The next paper to read is [<xref ref-type="bibr" rid="ref-120">120</xref>], which was based on the book by Rosenblatt (1962) [<xref ref-type="bibr" rid="ref-2">2</xref>], and from where the following equation<xref ref-type="fn" rid="fn305"><sup>305</sup></xref><fn id="fn305"><label>305</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-4">4</xref>) in [<xref ref-type="bibr" rid="ref-120">120</xref>].</p></fn> can be identified as being similar to Eq. (<xref ref-type="disp-formula" rid="eqn-493">493</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>), again in its original notation as</p>
<p><disp-formula id="eqn-494"><label>(494)</label><mml:math id="mml-eqn-494" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>&#x03BC;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:msub><mml:mi>y</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&gt;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;for&#xA0;</mml:mtext></mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2740"><mml:math id="mml-ieqn-2740"><mml:msub><mml:mi>y</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula> was the weight corresponding to the input <inline-formula id="ieqn-2741"><mml:math id="mml-ieqn-2741"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to the &#x201C;associated unit&#x201D; <inline-formula id="ieqn-2742"><mml:math id="mml-ieqn-2742"><mml:msub><mml:mi>a</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula> (neuron) from the stimulus pattern <inline-formula id="ieqn-2743"><mml:math id="mml-ieqn-2743"><mml:msub><mml:mi>S</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> (ith example in the dataset), <inline-formula id="ieqn-2744"><mml:math id="mml-ieqn-2744"><mml:msub><mml:mi>N</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:math></inline-formula> the number of &#x201C;associated units&#x201D; (neurons), <inline-formula id="ieqn-2745"><mml:math id="mml-ieqn-2745"><mml:mi>n</mml:mi></mml:math></inline-formula> the number of &#x201C;stimulus patterns&#x201D; (examples in the dataset), and &#x0398; the second of two thresholds, which were fixed real non-negative numbers, and which corresponded to the negative of the bias <inline-formula id="ieqn-2747"><mml:math id="mml-ieqn-2747"><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>), or <inline-formula id="ieqn-2748"><mml:math id="mml-ieqn-2748"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Figure <xref ref-type="fig" rid="fig-132">132</xref>.</p>
<p>To discriminate between two classes, the input <inline-formula id="ieqn-2749"><mml:math id="mml-ieqn-2749"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> took the value <inline-formula id="ieqn-2750"><mml:math id="mml-ieqn-2750"><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> or <inline-formula id="ieqn-2751"><mml:math id="mml-ieqn-2751"><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, when there was excitation coming from the stimulus pattern <inline-formula id="ieqn-2752"><mml:math id="mml-ieqn-2752"><mml:msub><mml:mi>S</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> to the neuron <inline-formula id="ieqn-2753"><mml:math id="mml-ieqn-2753"><mml:msub><mml:mi>a</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula>, and the value <inline-formula id="ieqn-2754"></inline-formula> when there was no excitation from <inline-formula id="ieqn-2755"><mml:math id="mml-ieqn-2755"><mml:msub><mml:mi>S</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-2756"><mml:math id="mml-ieqn-2756"><mml:msub><mml:mi>a</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula>. When the weighted voting or linear combination in Eq. (<xref ref-type="disp-formula" rid="eqn-494">494</xref>) surpassed the threshold &#x0398;, then the response was correct (or yields the value +1).</p>
<p>If the algebraic sum <inline-formula id="ieqn-2758"><mml:math id="mml-ieqn-2758"><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>&#x03BC;</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> of the connection strengths <inline-formula id="ieqn-2759"><mml:math id="mml-ieqn-2759"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> between the neuron (associated unit) <inline-formula id="ieqn-2760"><mml:math id="mml-ieqn-2760"><mml:msub><mml:mi>a</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula> and the sensory unit <inline-formula id="ieqn-2761"><mml:math id="mml-ieqn-2761"><mml:msub><mml:mi>s</mml:mi><mml:mi>&#x03C3;</mml:mi></mml:msub></mml:math></inline-formula> inside the pattern (example) <inline-formula id="ieqn-2762"><mml:math id="mml-ieqn-2762"><mml:msub><mml:mi>S</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> surpassed a threshold <italic>&#x03B8;</italic> (which was the first of two thresholds, and which does not correspond to the negative of the bias in modern networks), then the neuron <inline-formula id="ieqn-2764"><mml:math id="mml-ieqn-2764"><mml:msub><mml:mi>a</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:msub></mml:math></inline-formula> was activated:<xref ref-type="fn" rid="fn306"><sup>306</sup></xref><fn id="fn306"><label>306</label><p>First equation, unnumbered, in [<xref ref-type="bibr" rid="ref-120">120</xref>]. That this equation was unnumbered also indicated that it would not be subsequently referred to (and hence perhaps not considered as important).</p></fn></p>
<p><disp-formula id="eqn-495"><label>(495)</label><mml:math id="mml-eqn-495" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mi>&#x03BC;</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>:=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mstyle displaystyle="false" scriptlevel="0"><mml:mrow><mml:mfrac linethickness="0"><mml:mrow><mml:mi>&#x03C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>&#x03C3;</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:mrow></mml:munder><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2265;</mml:mo><mml:mi>&#x03B8;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-495">495</xref>) in [<xref ref-type="bibr" rid="ref-120">120</xref>] would correspond to Eq. (<xref ref-type="disp-formula" rid="eqn-490">490</xref>) in [<xref ref-type="bibr" rid="ref-119">119</xref>], with the connection strengths <inline-formula id="ieqn-2765"><mml:math id="mml-ieqn-2765"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> being &#x201C;random numbers having the possible values +1, -1, 0&#x201D;.</p>
<p>That Eq. (<xref ref-type="disp-formula" rid="eqn-495">495</xref>) was not numbered in [<xref ref-type="bibr" rid="ref-120">120</xref>] indicates that it played a minor role in this paper. The reason is clear, since the author of [<xref ref-type="bibr" rid="ref-120">120</xref>] stated<xref ref-type="fn" rid="fn307"><sup>307</sup></xref><fn id="fn307"><label>307</label><p>See above Eq. (<xref ref-type="disp-formula" rid="eqn-1">1</xref>) in [<xref ref-type="bibr" rid="ref-120">120</xref>].</p></fn> that &#x201C;the connections <inline-formula id="ieqn-2766"><mml:math id="mml-ieqn-2766"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> do not change&#x201D;, and thus &#x201C;we may disregard the sensory retina altogether&#x201D;, i.e., Eq. (<xref ref-type="disp-formula" rid="eqn-495">495</xref>).</p>
<p>Moreover, the very first sentence in [<xref ref-type="bibr" rid="ref-120">120</xref>] was &#x201C;The perceptron is a self-organizing and adaptive system proposed by Rosenblatt&#x201D;, and the book by [<xref ref-type="bibr" rid="ref-2">2</xref>] was immediately cited as Ref. 1, whereas only much later in the fourth page of [<xref ref-type="bibr" rid="ref-120">120</xref>] did the author write &#x201C;With the Perceptron, Rosenblatt offered for the first time a model...&#x201D;, and cited Rosenblatt&#x2019;s 1958 report first as Ref. 34, followed by the paper [<xref ref-type="bibr" rid="ref-119">119</xref>] as Ref. 35.</p>
<p>In a major work on AI dedicated to Rosenblatt after his death in a boat accident, the authors of [<xref ref-type="bibr" rid="ref-121">121</xref>], p.xi, in the Prologue of their book, referred to Rosenblatt&#x2019;s (1962) book [<xref ref-type="bibr" rid="ref-2">2</xref>] and not Rosenblatt&#x2019;s (1958) paper [<xref ref-type="bibr" rid="ref-119">119</xref>]:</p>
<disp-quote><p>&#x201C;<bold>The 1960s: Connectionists and Symbolists</bold></p><p>Interest in connectionist networks revived dramatically in 1962 with the publication of Frank Rosenblatt&#x2019;s book <italic>Principles of Neurodynamics</italic> in which he defined the machines he named perceptrons and proved many theories about them.&#x201D;</p>
</disp-quote><p>In fact, Rosenblatt&#x2019;s (1958) paper [<xref ref-type="bibr" rid="ref-119">119</xref>] was never referred to in [<xref ref-type="bibr" rid="ref-121">121</xref>], except for a brief mention of the influence of &#x201C;Rosenblatt&#x2019;s [1958]&#x201D; work on p. 19, without the full bibliographic details. The authors of [<xref ref-type="bibr" rid="ref-121">121</xref>] wrote:</p>
<disp-quote><p>&#x201C;However, it is not our goal here to evaluate these theories [to model brain functioning], but only to sketch a picture of the intellectual stage that was set for the perceptron concept. In this setting, Rosenblatt&#x2019;s [1958] schemes quickly took root, and soon there were perhaps as many as a hundred groups, large and small, experimenting with the model either as a &#x2018;learn&#x00C2;ing machine&#x2019; or in the guise of &#x2018;adaptive&#x2019; or &#x2018;self-organizing&#x2019; networks or &#x2018;automatic control&#x2019; systems.&#x201D;</p>
</disp-quote><p>So why was [<xref ref-type="bibr" rid="ref-119">119</xref>] often referred to for Eq. (<xref ref-type="disp-formula" rid="eqn-488">488</xref>) or Eq. (<xref ref-type="disp-formula" rid="eqn-489">489</xref>), instead of [<xref ref-type="bibr" rid="ref-120">120</xref>] or [<xref ref-type="bibr" rid="ref-2">2</xref>],<xref ref-type="fn" rid="fn308"><sup>308</sup></xref><fn id="fn308"><label>308</label><p>A search on the Web of Science on 2019.07.04 indicated that [<xref ref-type="bibr" rid="ref-119">119</xref>] received 2,346 citations, whereas [<xref ref-type="bibr" rid="ref-120">120</xref>] received 168 citations. A search on Google Books on the same day indicated that [<xref ref-type="bibr" rid="ref-2">2</xref>] received 21 citations.</p></fn> which would be much better references for these equations? One reason could be that citing [<xref ref-type="bibr" rid="ref-120">120</xref>] would not do justice to [<xref ref-type="bibr" rid="ref-119">119</xref>], which contained the germ of the idea, even though not as refined as four years later in [<xref ref-type="bibr" rid="ref-120">120</xref>] and [<xref ref-type="bibr" rid="ref-2">2</xref>]. Another reason could be the herd effect by following other authors who referred to [<xref ref-type="bibr" rid="ref-119">119</xref>], without actually reading the paper, or without comparing this paper to [<xref ref-type="bibr" rid="ref-120">120</xref>] or [<xref ref-type="bibr" rid="ref-2">2</xref>]. A best approach would be to refer to both [<xref ref-type="bibr" rid="ref-119">119</xref>] and [<xref ref-type="bibr" rid="ref-120">120</xref>], as papers like these would be more accessible than books like [<xref ref-type="bibr" rid="ref-2">2</xref>].</p> 
<statement id="st13_1"><title><xref ref-type="statement" rid="st13_1">Remark 13.1</xref>.</title>
<p><italic>The hype on the Rosenblatt perceptron</italic> Mark I computer described in the 1958 New York Times article shown in Figure <xref ref-type="fig" rid="fig-133">133</xref>, together with the criticism of the Rosenblatt perceptron in [<xref ref-type="bibr" rid="ref-121">121</xref>] for failing to represent the XOR function, led to an early great disappointment on the possibilities of AI when overreached expectations for such device did not pan out, and contributed to the first AI winter that lasted until the 1980s, with a resurgence in interest due to the development of backpropagation and application in psychology as reported in [<xref ref-type="bibr" rid="ref-22">22</xref>]. But some sixty years since the Mark I computer, AI still cannot even think like human babies yet: &#x201C;Understanding babies and young children may be one key to ensuring that the current &#x201C;AI spring&#x201D; continues&#x2014;despite some chilly autumnal winds in the air&#x201D; [<xref ref-type="bibr" rid="ref-321">321</xref>].&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p>
</statement> </sec>
<sec id="s13_2_2"><label>13.2.2</label>
<title>Dynamic, time dependence, Volterra series</title>
<p>For time-dependent input <inline-formula id="ieqn-2767"><mml:math id="mml-ieqn-2767"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the <italic>continuous temporal summation</italic>, mentioned in [<xref ref-type="bibr" rid="ref-2">2</xref>], p. 33, is present in all terms other than the constant term in the Volterra series<xref ref-type="fn" rid="fn309"><sup>309</sup></xref><fn id="fn309"><label>309</label><p>[<xref ref-type="bibr" rid="ref-19">19</xref>], p. 46. &#x201C;The Volterra series is a model for non-linear behavior similar to the Taylor series. It differs from the Taylor series in its ability to capture &#x2019;memory&#x2019; effects. The Taylor series can be used for approximating the response of a nonlinear system to a given input if the output of this system depends strictly on the input at that particular time. In the Volterra series the output of the nonlinear system depends on the input to the system at all other times. This provides the ability to capture the &#x2019;memory&#x2019; effect of devices like capacitors and inductors. It has been applied in the fields of medicine (biomedical engineering) and biology, especially <italic>neuroscience</italic>. In mathematics, a Volterra series denotes a functional expansion of a dynamic, nonlinear, time-invariant functional,&#x201D; in &#x201C;Volterra series&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Volterra_series&amp;oldid=854737438">version 12:49, 13 August 2018</ext-link>.</p></fn> of the estimated output <inline-formula id="ieqn-2768"><mml:math id="mml-ieqn-2768"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as a result of the input <inline-formula id="ieqn-2769"><mml:math id="mml-ieqn-2769"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></p>
<p><disp-formula id="eqn-496"><label>(496)</label><mml:math id="mml-eqn-496" display="block"><mml:mtable columnalign="left left left left" rowspacing="0.9em 0.9em 0.4em" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mo>+</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munder><mml:mo>&#x222B;</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>&#x222B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>.</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mo>+</mml:mo><mml:mo>&#x222B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mo>&#x222C;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mphantom><mml:mo>=</mml:mo></mml:mphantom></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mphantom><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mphantom></mml:mrow></mml:mtd><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mo>+</mml:mo><mml:mo>&#x222D;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mo>&#x22EF;</mml:mo></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2770"><mml:math id="mml-ieqn-2770"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the kernel of the nth order, with all integrals going from <inline-formula id="ieqn-2771"><mml:math id="mml-ieqn-2771"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> to the current time <inline-formula id="ieqn-2772"><mml:math id="mml-ieqn-2772"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula>, for <inline-formula id="ieqn-2773"><mml:math id="mml-ieqn-2773"><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>. The linear-order approximation of the Volterra series in Eq. (<xref ref-type="disp-formula" rid="eqn-496">496</xref>) is then</p>
<p><disp-formula id="eqn-497"><label>(497)</label><mml:math id="mml-eqn-497" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2248;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mo>+</mml:mo><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mo>+</mml:mo><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>&#x03C4;</mml:mi></mml:mstyle></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with the continuous linear combination (weighted sum) appearing in the second term. The convolution integral in Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>) is the basis for convolutional networks for highly effective and efficient image recognition, inspired by mammalian visual system. A review of convolutional networks outside the scope here, despite them being the &#x201C;greatest success story of biologically inspired artificial intelligence&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 353.</p>
<p>For biological neuron models, both The input <inline-formula id="ieqn-2774"><mml:math id="mml-ieqn-2774"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and the continuous weighted sum <inline-formula id="ieqn-2775"><mml:math id="mml-ieqn-2775"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> can be either currents, with nA (nano Ampere) as dimension, or firing rates (frequency), with Hz (Hertz) as dimension.</p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>) is the continuous temporal summation, counterpart of the discrete spatial summation in Eq. (<xref ref-type="disp-formula" rid="eqn-26">26</xref>), with the constant term <inline-formula id="ieqn-2776"><mml:math id="mml-ieqn-2776"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> playing a role similar to that of the bias <inline-formula id="ieqn-2777"><mml:math id="mml-ieqn-2777"><mml:msubsup><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. The linear kernel <inline-formula id="ieqn-2778"><mml:math id="mml-ieqn-2778"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (also called the Wiener kernel,<xref ref-type="fn" rid="fn310"><sup>310</sup></xref><fn id="fn310"><label>310</label><p>Since the first two terms in the Volterra series coincide with the first two terms in the Wiener series; see [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 46.</p></fn> or synaptic kernel in brain modeling)<xref ref-type="fn" rid="fn311"><sup>311</sup></xref><fn id="fn311"><label>311</label><p>See [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 234.</p></fn> is the weight on the input <inline-formula id="ieqn-2779"><mml:math id="mml-ieqn-2779"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2780"><mml:math id="mml-ieqn-2780"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> going from <inline-formula id="ieqn-2781"><mml:math id="mml-ieqn-2781"><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:math></inline-formula> to the current time <inline-formula id="ieqn-2782"><mml:math id="mml-ieqn-2782"><mml:mi>t</mml:mi></mml:math></inline-formula>. In other words, the whole history of the input <inline-formula id="ieqn-2783"><mml:math id="mml-ieqn-2783"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> prior to the current time has an influence on the output <inline-formula id="ieqn-2784"><mml:math id="mml-ieqn-2784"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with typically smaller weight for more distant input (fading memory). For this reason, the synaptic kernel used in the biological neuron firing-rate models is often chosen to have an exponential decay of the form:<xref ref-type="fn" rid="fn312"><sup>312</sup></xref><fn id="fn312"><label>312</label><p>See [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 234, below Eq. (7.3).</p></fn></p>
<p><disp-formula id="eqn-498"><label>(498)</label><mml:math id="mml-eqn-498" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>:=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mfrac><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2785"><mml:math id="mml-ieqn-2785"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> is the synaptic time constant such that the smaller <inline-formula id="ieqn-2786"><mml:math id="mml-ieqn-2786"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> is, the less memory of past input values, and</p>
<p><disp-formula id="eqn-499"><label>(499)</label><mml:math id="mml-eqn-499" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>i.e., the continuous weighted sum <inline-formula id="ieqn-2787"><mml:math id="mml-ieqn-2787"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> would correspond to the instantaneous <inline-formula id="ieqn-2788"><mml:math id="mml-ieqn-2788"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (without memory of past input values) as the synaptic time constant <inline-formula id="ieqn-2789"><mml:math id="mml-ieqn-2789"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> goes to zero (no memory).</p> 
<statement id="st13_2"><title><xref ref-type="statement" rid="st13_2">Remark 13.2</xref>.</title>
<p>The discrete counterpart of the linear part of the Volterra series in Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>) can be found in the exponential-smoothing time series in Eq. (<xref ref-type="disp-formula" rid="eqn-212">212</xref>), with the kernel <inline-formula id="ieqn-2790"><mml:math id="mml-ieqn-2790"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> being the exponential function <inline-formula id="ieqn-2791"><mml:math id="mml-ieqn-2791"><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>; see Section <xref ref-type="sec" rid="s6_5_3">6.5.3</xref> on exponential smoothing in forecasting. The similarity is even closer when the synaptic kernel <inline-formula id="ieqn-2792"><mml:math id="mml-ieqn-2792"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is of exponential form as in Eq. (<xref ref-type="disp-formula" rid="eqn-498">498</xref>).&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>In firing-rate models of the brain (see Figure <xref ref-type="fig" rid="fig-27">27</xref>), the function <inline-formula id="ieqn-2793"><mml:math id="mml-ieqn-2793"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents an input firing-rate at a synapse, <inline-formula id="ieqn-2794"><mml:math id="mml-ieqn-2794"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is called the background firing rate, and the weighted sum <inline-formula id="ieqn-2795"><mml:math id="mml-ieqn-2795"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> has the dimension of firing rate (Hz.</p>
<p>For a neuron with <inline-formula id="ieqn-2796"><mml:math id="mml-ieqn-2796"><mml:mi>n</mml:mi></mml:math></inline-formula> pre-synaptic inputs <inline-formula id="ieqn-2797"><mml:math id="mml-ieqn-2797"><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> (either currents or firing rates) as depicted in Figure <xref ref-type="fig" rid="fig-131">131</xref> of a biological neuron, the total input <inline-formula id="ieqn-2798"><mml:math id="mml-ieqn-2798"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (current or firing rate, respectively) going into the soma (cell body, Figure <xref ref-type="fig" rid="fig-131">131</xref>), called total somatic input, is a discrete weighted sum of all post-synaptic continuous weighted sums <inline-formula id="ieqn-2799"><mml:math id="mml-ieqn-2799"><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> expressed in Eq. (<xref ref-type="disp-formula" rid="eqn-497">497</xref>), assuming the same synaptic kernel <inline-formula id="ieqn-2800"><mml:math id="mml-ieqn-2800"><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> at all synapses:</p>
<p><disp-formula id="eqn-500"><label>(500)</label><mml:math id="mml-eqn-500" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mover><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:munderover><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>&#x1D4A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>&#x03C4;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-2801"><mml:math id="mml-ieqn-2801"><mml:mover accent='true'><mml:mrow><mml:msub><mml:mi>&#x1D4A6;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> being the constant bias,<xref ref-type="fn" rid="fn313"><sup>313</sup></xref><fn id="fn313"><label>313</label><p>The negative of the bias <inline-formula id="ieqn-3344"><mml:math id="mml-ieqn-3344"><mml:mrow><mml:mover accent='true'><mml:mrow><mml:msub><mml:mi>&#x1D4A6;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is the threshold. The constant bias <inline-formula id="ieqn-3345"><mml:math id="mml-ieqn-3345"><mml:mrow><mml:mover accent='true'><mml:mrow><mml:msub><mml:mi>&#x1D4A6;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is called the background firing rate when the inputs <inline-formula id="ieqn-3346"><mml:math id="mml-ieqn-3346"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> are firing rates.</p></fn> and <inline-formula id="ieqn-2802"><mml:math id="mml-ieqn-2802"><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> the synaptic weight associated with the synapse <inline-formula id="ieqn-2803"><mml:math id="mml-ieqn-2803"><mml:mi>i</mml:mi></mml:math></inline-formula>.</p>
<p>Using the synaptic kernel Eq. (<xref ref-type="disp-formula" rid="eqn-498">498</xref>) in Eq. (<xref ref-type="disp-formula" rid="eqn-500">500</xref>) for the total somatic input <inline-formula id="ieqn-2804"><mml:math id="mml-ieqn-2804"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and differentiate,<xref ref-type="fn" rid="fn314"><sup>314</sup></xref><fn id="fn314"><label>314</label><p>In general, for <inline-formula id="ieqn-3347"><mml:math id="mml-ieqn-3347"><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:math></inline-formula>, then <inline-formula id="ieqn-3348"><mml:math id="mml-ieqn-3348"><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mover><mml:mi>B</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mover><mml:mi>A</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:math></inline-formula>.</p></fn> the following ordinary differential equation is obtained</p>
<p><disp-formula id="eqn-501"><label>(501)</label><mml:math id="mml-eqn-501" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mo>+</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
 
<statement id="st13_3"><title><xref ref-type="statement" rid="st13_3">Remark 13.3</xref>.</title>
<p>The second term in Eq. (<xref ref-type="disp-formula" rid="eqn-501">501</xref>), with time-independent input <inline-formula id="ieqn-2805"><mml:math id="mml-ieqn-2805"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, is the steady state of the total somatic input <inline-formula id="ieqn-2806"><mml:math id="mml-ieqn-2806"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-502"><label>(502)</label><mml:math id="mml-eqn-502" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-503"><label>(503)</label><mml:math id="mml-eqn-503" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>:=</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo>:=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-504"><label>(504)</label><mml:math id="mml-eqn-504" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mrow><mml:mtext>&#xA0;as&#xA0;</mml:mtext></mml:mrow><mml:mi>t</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>As a result, the subscript &#x221E; is often used to denote as the steady state solution, such as <inline-formula id="ieqn-2808"><mml:math id="mml-ieqn-2808"><mml:msub><mml:mi>R</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub></mml:math></inline-formula> in the model of neocortical neurons Eq. (<xref ref-type="disp-formula" rid="eqn-509">509</xref>) [<xref ref-type="bibr" rid="ref-118">118</xref>].&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement><p>For constant total somatic input <inline-formula id="ieqn-2809"><mml:math id="mml-ieqn-2809"><mml:mi>z</mml:mi></mml:math></inline-formula>, the output firing-rate <inline-formula id="ieqn-2810"><mml:math id="mml-ieqn-2810"><mml:mi>y</mml:mi></mml:math></inline-formula> is given by an activation function <inline-formula id="ieqn-2811"><mml:math id="mml-ieqn-2811"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (e.g, scaled ReLU in Figure <xref ref-type="fig" rid="fig-25">25</xref> and Figure <xref ref-type="fig" rid="fig-26">26</xref>) through relation<xref ref-type="fn" rid="fn315"><sup>315</sup></xref><fn id="fn315"><label>315</label><p>See [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 234, in original notation as <inline-formula id="ieqn-3349"><mml:math id="mml-ieqn-3349"><mml:mi>v</mml:mi><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-3350"><mml:math id="mml-ieqn-3350"><mml:mi>v</mml:mi></mml:math></inline-formula> being the output firing rate, <inline-formula id="ieqn-3351"><mml:math id="mml-ieqn-3351"><mml:mi>F</mml:mi></mml:math></inline-formula> an activation function, and <inline-formula id="ieqn-3352"><mml:math id="mml-ieqn-3352"><mml:msub><mml:mi>I</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> the total synaptic current.</p></fn></p>
<p><disp-formula id="eqn-505"><label>(505)</label><mml:math id="mml-eqn-505" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi>z</mml:mi><mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2812"><mml:math id="mml-ieqn-2812"><mml:mi>c</mml:mi></mml:math></inline-formula> is a scaling constant to match the slope of the firing rate (F) vs input current (I) relation called the FI curve obtained from experiments; see Figure <xref ref-type="fig" rid="fig-27">27</xref> and Figure <xref ref-type="fig" rid="fig-28">28</xref>. the total somatic input <inline-formula id="ieqn-2813"><mml:math id="mml-ieqn-2813"><mml:mi>z</mml:mi></mml:math></inline-formula> can be thought of as being converted from current (nA) to frequency (Hz) by multiplying with the converting constant <inline-formula id="ieqn-2814"><mml:math id="mml-ieqn-2814"><mml:mi>c</mml:mi></mml:math></inline-formula>.<xref ref-type="fn" rid="fn316"><sup>316</sup></xref><fn id="fn316"><label>316</label><p>See [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 234, subsection &#x201C;The Firing Rate&#x201D;.</p></fn></p>
<p>At this stage, there are two possible firing-rate models. The first firing-rate model consists of (1) Eq. (<xref ref-type="disp-formula" rid="eqn-501">501</xref>), the ODE for the total somatic input firing rate <inline-formula id="ieqn-2815"><mml:math id="mml-ieqn-2815"><mml:mi>z</mml:mi></mml:math></inline-formula>, followed by (2) the &#x201C;static&#x201D; relation between output firing-rate <inline-formula id="ieqn-2816"><mml:math id="mml-ieqn-2816"><mml:mi>y</mml:mi></mml:math></inline-formula> and constant input firing rate <inline-formula id="ieqn-2817"><mml:math id="mml-ieqn-2817"><mml:mi>z</mml:mi></mml:math></inline-formula>, expressed in Eq. (<xref ref-type="disp-formula" rid="eqn-505">505</xref>), but now used for time-dependent total somatic input firing rate <inline-formula id="ieqn-2818"><mml:math id="mml-ieqn-2818"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p> 
<statement id="st13_4"><title><xref ref-type="statement" rid="st13_4">Remark 13.4</xref>.</title>
<p>The steady-state output <inline-formula id="ieqn-2819"><mml:math id="mml-ieqn-2819"><mml:msub><mml:mi>y</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub></mml:math></inline-formula> in the first firing-rate model described in Eq. (<xref ref-type="disp-formula" rid="eqn-500">500</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-505">505</xref>) in the case of constant inputs <inline-formula id="ieqn-2820"><mml:math id="mml-ieqn-2820"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is therefore</p>
<p><disp-formula id="eqn-506"><label>(506)</label><mml:math id="mml-eqn-506" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>y</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2821"><mml:math id="mml-ieqn-2821"><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub></mml:math></inline-formula> is given by Eq. (<xref ref-type="disp-formula" rid="eqn-503">503</xref>).&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>The second firing-rate model consists of using Eq. (<xref ref-type="disp-formula" rid="eqn-501">501</xref>) for the total somatic input firing rate <inline-formula id="ieqn-2822"><mml:math id="mml-ieqn-2822"><mml:mi>z</mml:mi></mml:math></inline-formula>, which is then used as input for the following ODE for the output firing-rate <inline-formula id="ieqn-2823"><mml:math id="mml-ieqn-2823"><mml:mi>y</mml:mi></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-507"><label>(507)</label><mml:math id="mml-eqn-507" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo>+</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the activation function <inline-formula id="ieqn-2824"><mml:math id="mml-ieqn-2824"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is applied on the time-dependent total somatic input firing rate <inline-formula id="ieqn-2825"><mml:math id="mml-ieqn-2825"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, but with a different time constant <inline-formula id="ieqn-2826"><mml:math id="mml-ieqn-2826"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>r</mml:mi></mml:msub></mml:math></inline-formula>, which describes how fast the output firing rate <inline-formula id="ieqn-2827"><mml:math id="mml-ieqn-2827"><mml:mi>y</mml:mi></mml:math></inline-formula> approaches steady state for <italic>constant</italic> input <inline-formula id="ieqn-2828"><mml:math id="mml-ieqn-2828"><mml:mi>z</mml:mi></mml:math></inline-formula>.</p>
<p>Eq. (<xref ref-type="disp-formula" rid="eqn-507">507</xref>) is a recurring theme that has been frequently used in papers in neuroscience and artificial neural networks. Below are a few relevant papers for this review, particularly the <italic>continuous</italic> recurrent neural networks (RNNs)&#x2014;such as Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>), Eq. (<xref ref-type="disp-formula" rid="eqn-512">512</xref>), and Eqs. (<xref ref-type="disp-formula" rid="eqn-515">515</xref>)-(<xref ref-type="disp-formula" rid="eqn-516">516</xref>)&#x2014;which are the counterparts of the <italic>discrete</italic> RNNs in Section <xref ref-type="sec" rid="s7_1">7.1</xref>.</p>

<fig id="fig-134">
<label>Figure 134</label>
<caption><title><italic>Model of neocortical neurons</italic> in [<xref ref-type="bibr" rid="ref-118">118</xref>] as a simplification of the model in [<xref ref-type="bibr" rid="ref-322">322</xref>] (Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref>): A capacitor <inline-formula id="ieqn-2028"><mml:math id="mml-ieqn-2028"><mml:mi>C</mml:mi></mml:math></inline-formula> with a potential <inline-formula id="ieqn-2029"><mml:math id="mml-ieqn-2029"><mml:mi>V</mml:mi></mml:math></inline-formula> across its plates, in parallel with the equilibrium potentials <inline-formula id="ieqn-2030"><mml:math id="mml-ieqn-2030"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mtext>Na</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> (sodium) and <inline-formula id="ieqn-2031"><mml:math id="mml-ieqn-2031"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> (potassium) in opposite direction. Two variable resistors <inline-formula id="ieqn-2032"><mml:math id="mml-ieqn-2032"><mml:msubsup><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2033"><mml:math id="mml-ieqn-2033"><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mi>R</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> are each in series with one of the mentioned two equilibrium potentials. The capacitor <inline-formula id="ieqn-2034"><mml:math id="mml-ieqn-2034"><mml:mi>C</mml:mi></mml:math></inline-formula> is also in parallel with a current source <inline-formula id="ieqn-2035"><mml:math id="mml-ieqn-2035"><mml:mi>I</mml:mi></mml:math></inline-formula>. The notation <inline-formula id="ieqn-2036"><mml:math id="mml-ieqn-2036"><mml:mi>R</mml:mi></mml:math></inline-formula> is used here for the &#x201C;recovery variable&#x201D;, not a resistor. See Eqs. (<xref ref-type="disp-formula" rid="eqn-508">508</xref>)-(<xref ref-type="disp-formula" rid="eqn-509">509</xref>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-134.tif"/>
</fig>
<p>The model for neocortical neurons in [<xref ref-type="bibr" rid="ref-118">118</xref>], a simplification of the model by [<xref ref-type="bibr" rid="ref-322">322</xref>], was employed in [<xref ref-type="bibr" rid="ref-116">116</xref>] as starting point to develop a formulation that produced SubFigure(b) in Figure <xref ref-type="fig" rid="fig-28">28</xref> consists of two coupled ODE&#x2019;s<xref ref-type="fn" rid="fn317"><sup>317</sup></xref><fn id="fn317"><label>317</label><p>Eq. (<xref ref-type="disp-formula" rid="eqn-1">1</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-2">2</xref>) in [<xref ref-type="bibr" rid="ref-116">116</xref>].</p></fn></p>
<p><disp-formula id="eqn-508"><label>(508)</label><mml:math id="mml-eqn-508" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>C</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mtext>Na</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mi>R</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>I</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-509"><label>(509)</label><mml:math id="mml-eqn-509" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>&#x03C4;</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>R</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2829"><mml:math id="mml-ieqn-2829"><mml:msub><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2830"><mml:math id="mml-ieqn-2830"><mml:msub><mml:mi>R</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> are prescribed quadratic polynomials in the potential <inline-formula id="ieqn-2831"><mml:math id="mml-ieqn-2831"><mml:mi>V</mml:mi></mml:math></inline-formula>, making the right-hand side of Eq. (<xref ref-type="disp-formula" rid="eqn-508">508</xref>) of cubic order. Eq. (<xref ref-type="disp-formula" rid="eqn-508">508</xref>) describes the change in the membrane potential <inline-formula id="ieqn-2832"><mml:math id="mml-ieqn-2832"><mml:mi>V</mml:mi></mml:math></inline-formula> due to the capacitance <inline-formula id="ieqn-2833"><mml:math id="mml-ieqn-2833"><mml:mi>C</mml:mi></mml:math></inline-formula> in parallel with other circuit elements shown in Figure <xref ref-type="fig" rid="fig-134">134</xref>, with (1) <inline-formula id="ieqn-2834"><mml:math id="mml-ieqn-2834"><mml:msub><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2835"><mml:math id="mml-ieqn-2835"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mtext>Na</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> being the activation function and equilibrium potential for the sodium ion (Na<inline-formula id="ieqn-2836"><mml:math id="mml-ieqn-2836"><mml:msup><mml:mi></mml:mi><mml:mo>+</mml:mo></mml:msup></mml:math></inline-formula>), respectively, (2) <inline-formula id="ieqn-2837"><mml:math id="mml-ieqn-2837"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2838"><mml:math id="mml-ieqn-2838"><mml:mi>R</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-2839"><mml:math id="mml-ieqn-2839"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> being the conductance, recovery variable, and equilibrium potential for the potassium ion (K<inline-formula id="ieqn-2840"><mml:math id="mml-ieqn-2840"><mml:msup><mml:mi></mml:mi><mml:mo>+</mml:mo></mml:msup></mml:math></inline-formula>), respectively, and (3) <inline-formula id="ieqn-2841"><mml:math id="mml-ieqn-2841"><mml:mi>I</mml:mi></mml:math></inline-formula> the stimulating current. Eq. (<xref ref-type="disp-formula" rid="eqn-509">509</xref>) for the recovery variable <inline-formula id="ieqn-2842"><mml:math id="mml-ieqn-2842"><mml:mi>R</mml:mi></mml:math></inline-formula> has the same form as Eq. (<xref ref-type="disp-formula" rid="eqn-507">507</xref>), with <inline-formula id="ieqn-2843"><mml:math id="mml-ieqn-2843"><mml:msub><mml:mi>R</mml:mi><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> being the steady state; see Remark <xref ref-type="statement" rid="st13_4">13.4</xref>.</p>
<p>To create a continuous recurrent neural network described by ODEs in Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>), the input <inline-formula id="ieqn-2844"><mml:math id="mml-ieqn-2844"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-500">500</xref>) is replaced by the output <inline-formula id="ieqn-2845"><mml:math id="mml-ieqn-2845"><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (i.e., a feedback), and the bias <inline-formula id="ieqn-2846"><mml:math id="mml-ieqn-2846"><mml:mover accent='true'><mml:mrow><mml:msub><mml:mi>&#x1D4A6;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo stretchy='true'>&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> becomes an input, now denoted by <inline-formula id="ieqn-2847"><mml:math id="mml-ieqn-2847"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>. Electrical circuits can be designed to approximate the dynamical behavior a spatially-discrete, <italic>temporally-continuous</italic> recurrent neural network (RNN) described by Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>), which is Eq. (<xref ref-type="disp-formula" rid="eqn-1">1</xref>) in [<xref ref-type="bibr" rid="ref-32">32</xref>]:<xref ref-type="fn" rid="fn318"><sup>318</sup></xref><fn id="fn318"><label>318</label><p>In original notation, Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>) was written as <inline-formula id="ieqn-3353"><mml:math id="mml-ieqn-3353"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo></mml:msub></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-32">32</xref>], whose outputs <inline-formula id="ieqn-3354"><mml:math id="mml-ieqn-3354"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> in the previous expression are now rewritten as <inline-formula id="ieqn-3355"><mml:math id="mml-ieqn-3355"><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>), and the biases <inline-formula id="ieqn-3356"><mml:math id="mml-ieqn-3356"><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, playing the role of inputs, are rewritten as <inline-formula id="ieqn-3357"><mml:math id="mml-ieqn-3357"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to be consistent with the notation for inputs and outputs used throughout in the present work; see Section <xref ref-type="sec" rid="s4_2">4.2</xref> on matrix notation, Eq. (<xref ref-type="disp-formula" rid="eqn-7">7</xref>), and Section <xref ref-type="sec" rid="s7_1">7.1</xref> on <italic>discrete</italic> recurrent neural networks, which are discrete in both space and time. The paper [<xref ref-type="bibr" rid="ref-32">32</xref>] was cited in both [<xref ref-type="bibr" rid="ref-19">19</xref>] and [<xref ref-type="bibr" rid="ref-36">36</xref>], with the latter leading us to it.</p></fn></p>
<p><disp-formula id="eqn-510"><label>(510)</label><mml:math id="mml-eqn-510" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The network is called symmetric if the weight matrix is symmetric, i.e.,</p>
<p><disp-formula id="eqn-511"><label>(511)</label><mml:math id="mml-eqn-511" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The difference between Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-507">507</xref>) is that Eq. (<xref ref-type="disp-formula" rid="eqn-507">507</xref>) is based on the expression for <inline-formula id="ieqn-2848"><mml:math id="mml-ieqn-2848"><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-500">500</xref>), and thus has no feedback loop.</p>
<p>A time-dependent time delay <inline-formula id="ieqn-2849"><mml:math id="mml-ieqn-2849"><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> can be introduced into Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>) leading to spatially-discrete, <italic>temporally-continuous</italic> RNNs with time delay, which is Eq. (<xref ref-type="disp-formula" rid="eqn-1">1</xref>) in [<xref ref-type="bibr" rid="ref-323">323</xref>]:<xref ref-type="fn" rid="fn319"><sup>319</sup></xref><fn id="fn319"><label>319</label><p>In original notation, Eq. (<xref ref-type="disp-formula" rid="eqn-512">512</xref>) was written as <inline-formula id="ieqn-3358"><mml:math id="mml-ieqn-3358"><mml:mover><mml:mi>z</mml:mi><mml:mo>&#x2022;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>A</mml:mi><mml:mi>z</mml:mi><mml:mo>+</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>J</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in [<xref ref-type="bibr" rid="ref-323">323</xref>], where <inline-formula id="ieqn-3359"><mml:math id="mml-ieqn-3359"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-3360"><mml:math id="mml-ieqn-3360"><mml:mi>A</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">T</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-3361"><mml:math id="mml-ieqn-3361"><mml:mi>h</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-3362"><mml:math id="mml-ieqn-3362"><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula>.</p></fn></p>
<p><disp-formula id="eqn-512"><label>(512)</label><mml:math id="mml-eqn-512" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">T</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>+</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the diagonal matrix <inline-formula id="ieqn-2850"><mml:math id="mml-ieqn-2850"><mml:mrow><mml:mi mathvariant="bold-italic">T</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Diag</mml:mtext></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> contains the synaptic time constants <inline-formula id="ieqn-2851"><mml:math id="mml-ieqn-2851"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> as its diagonal coefficients, the matrix <inline-formula id="ieqn-2852"><mml:math id="mml-ieqn-2852"><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> contains the outputs, the bias matrix <inline-formula id="ieqn-2853"><mml:math id="mml-ieqn-2853"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> plays the role of input matrix (thus denoted by <inline-formula id="ieqn-2854"><mml:math id="mml-ieqn-2854"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:math></inline-formula> instead of <inline-formula id="ieqn-2855"><mml:math id="mml-ieqn-2855"><mml:mrow><mml:mi mathvariant="bold-italic">b</mml:mi></mml:mrow></mml:math></inline-formula>), <inline-formula id="ieqn-2856"><mml:math id="mml-ieqn-2856"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the activation function, <inline-formula id="ieqn-2857"><mml:math id="mml-ieqn-2857"><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> the weight matrix, and <inline-formula id="ieqn-2858"><mml:math id="mml-ieqn-2858"><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the time-dependent delay; see Figure <xref ref-type="fig" rid="fig-135">135</xref>.</p>
<p>For discrete RNNs, the delay is a constant integer set to one, i.e.,</p>
<p><disp-formula id="eqn-513"><label>(513)</label><mml:math id="mml-eqn-513" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi><mml:mrow><mml:mtext>&#xA0;(integer)</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>as expressed in Eq. (<xref ref-type="disp-formula" rid="eqn-275">275</xref>) in Section <xref ref-type="sec" rid="s7_1">7.1</xref>.</p>
<p>Both Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-512">512</xref>) can be rewritten in the following form:</p>
<p><disp-formula id="eqn-514"><label>(514)</label><mml:math id="mml-eqn-514" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="bold-italic">T</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>A densely distributed pre-synaptic input points [see Eq. (<xref ref-type="disp-formula" rid="eqn-500">500</xref>) and Figure <xref ref-type="fig" rid="fig-131">131</xref> of a biological neurons] can be approximated by a continuous distribution in space, represented by <inline-formula id="ieqn-2859"><mml:math id="mml-ieqn-2859"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2860"><mml:math id="mml-ieqn-2860"><mml:mi>s</mml:mi></mml:math></inline-formula> being the space variable, and <inline-formula id="ieqn-2861"><mml:math id="mml-ieqn-2861"><mml:mi>t</mml:mi></mml:math></inline-formula> the time variable. In this case, a <italic>continuous</italic> RNN in both space and time, called &#x201C;continuously labeled RNN&#x201D;, can be written as follows:<xref ref-type="fn" rid="fn320"><sup>320</sup></xref><fn id="fn320"><label>320</label><p>[<xref ref-type="bibr" rid="ref-19">19</xref>], p. 240, Eq. (7.14).</p></fn></p>
<p><disp-formula id="eqn-515"><label>(515)</label><mml:math id="mml-eqn-515" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-516"><label>(516)</label><mml:math id="mml-eqn-516" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>M</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mi>d</mml:mi><mml:mi>&#x03B3;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-2862"><mml:math id="mml-ieqn-2862"><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> is the neuron density, assumed to be constant. Space-time continuous RNNs such as Eqs. (<xref ref-type="disp-formula" rid="eqn-515">515</xref>)-(<xref ref-type="disp-formula" rid="eqn-516">516</xref>) have been used to model, e.g., the visually responsive neurons in the premotor cortex [<xref ref-type="bibr" rid="ref-19">19</xref>], p. 242.</p>
<fig id="fig-135">
<label>Figure 135</label>
<caption><title><italic>Continuous recurrent neural network with time-dependent delay</italic> <inline-formula id="ieqn-2037"><mml:math id="mml-ieqn-2037"><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (green feedback loop, Section <xref ref-type="sec" rid="s13_2_2">13.2.2</xref>), as expressed in Eq. (<xref ref-type="disp-formula" rid="eqn-514">514</xref>), where <inline-formula id="ieqn-2038"><mml:math id="mml-ieqn-2038"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the operator with the first defivative term plus a standard static term&#x2014;which is an activation function acting on linear combination of input and bias, i.e., <inline-formula id="ieqn-2039"><mml:math id="mml-ieqn-2039"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as in Eq. (<xref ref-type="disp-formula" rid="eqn-35">35</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-32">32</xref>)&#x2014;<inline-formula id="ieqn-2040"><mml:math id="mml-ieqn-2040"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the input, <inline-formula id="ieqn-2041"><mml:math id="mml-ieqn-2041"><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the output with the red feedback loop, and <inline-formula id="ieqn-2042"><mml:math id="mml-ieqn-2042"><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> the delayed output with the green feedback loop. This figure is the more general continuous counterpart of the discrete RNN in Figure <xref ref-type="fig" rid="fig-79">79</xref>, represented by Eq. (<xref ref-type="disp-formula" rid="eqn-275">275</xref>), which is a particular case of Eq. (<xref ref-type="disp-formula" rid="eqn-514">514</xref>). We also refer readers to Remark <xref ref-type="statement" rid="st7_1">7.1</xref> and the notation equivalence <inline-formula id="ieqn-2043"><mml:math id="mml-ieqn-2043"><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2261;</mml:mo><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as noted in Eq. (<xref ref-type="disp-formula" rid="eqn-276">276</xref>).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-135.tif"/>
</fig>
<fig id="fig-136">
<label>Figure 136</label>
<caption><title><italic>Crayfish</italic> (Section <xref ref-type="sec" rid="s13_3_2">13.3.2</xref>), freshwater crustaceans. Anatomy.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-136.tif"/>
</fig>
</sec> </sec>
<sec id="s13_3"><label>13.3</label>
<title>Activation functions</title>
<sec id="s13_3_1"><label>13.3.1</label>
<title>Logistic sigmoid</title>
<p>The use of the logistic sigmoid function (Figure <xref ref-type="fig" rid="fig-30">30</xref>) in neuroscience dates back since the seminal work of Nobel Laureates Hodgkin &amp; Huxley (1952) [<xref ref-type="bibr" rid="ref-322">322</xref>] in the form of an electrical circuit (Figure <xref ref-type="fig" rid="fig-134">134</xref>), and since the work reported in [<xref ref-type="bibr" rid="ref-35">35</xref>] in a form closer to today&#x2019;s network; see also [<xref ref-type="bibr" rid="ref-325">325</xref>] and [<xref ref-type="bibr" rid="ref-37">37</xref>].</p>
<p>The authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 219, remarked: &#x201C;Despite the early popularity of rectification (see next Section <xref ref-type="sec" rid="s13_3_2">13.3.2</xref>), it was largely replaced by sigmoids in the 1980s, perhaps because sigmoids perform better when neural networks are very small.&#x201D;</p>
<p>The rectified linear function has, however, made a come back and was a key component responsible for the success of deep learning, and helped inspired a variant that in 2015 surpassed human-level performance in image classification, as &#x201C;it expedites convergence of the training procedure [<xref ref-type="bibr" rid="ref-16">16</xref>] and leads to better solutions [21, 8, 20, 34] than conventional sigmoid-like units&#x201D; [<xref ref-type="bibr" rid="ref-61">61</xref>]. See Section <xref ref-type="sec" rid="s5_3_3">5.3.3</xref> on Parametric Rectified Linear Unit.</p>

<fig id="fig-137">
<label>Figure 137</label>
<caption><title><italic>Crayfish giant motor synapse</italic> (Section <xref ref-type="sec" rid="s13_3_2">13.3.2</xref>). The (pre-synaptic) lateral giant fiber was connected to the (post-synaptic) giant motor fiber through a synapse where the two fibers cross each other at the location annotated by &#x201C;Giant motor synapse&#x201D; in the figure. This synapse was right underneath the giant motor fiber, at the crossing and contact point, and thus could not be seen. The two left electrodes (including the second electrode from left) were inserted in the lateral giant fiber, with the two right electrodes in the giant motor fiber. Currents were injected into the two electrodes indicated by solid red arrows, and electrical outputs recorded from the two electrodes indicated by dashed blue arrows [<xref ref-type="bibr" rid="ref-326">326</xref>]. (Figure reproduced with permission of the publisher Wiley.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-137.tif"/>
</fig>
</sec>
<sec id="s13_3_2"><label>13.3.2</label>
<title>Rectified linear unit (ReLU)</title>
<p>Yoshua Bengio, the senior author of [<xref ref-type="bibr" rid="ref-78">78</xref>] and a Turing Award recipient, recounted the rise in popularity of ReLU in deep learning networks in an interview in [<xref ref-type="bibr" rid="ref-79">79</xref>]:</p>
<disp-quote><p>&#x201C;The big question was how we could train deeper networks... Then a few years later, we discovered that we didn&#x2019;t need these approaches [Restricted Boltzmann Machines, autoencoders] to train deep networks, we could just change the nonlinearity. One of my students was working with neuroscientists, and we thought that we should try rectified linear units (ReLUs)&#x2014;we called them rectifiers in those days&#x2014;because they were more biologically plausible, and this is an example of actually taking inspiration from the brain. We had previously used a sigmoid function to train neural nets, but it turned out that by using ReLUs we could suddenly train very deep nets much more easily. That was another big change that occurred around 2010 or 2011.&#x201D;</p>
</disp-quote><p>The student mentioned by Bengio was likely the first author of [<xref ref-type="bibr" rid="ref-113">113</xref>]; see also the earlier Section <xref ref-type="sec" rid="s4_4_2">4.4.2</xref> on activation functions.</p>
<p>We were aware of Ref. [<xref ref-type="bibr" rid="ref-32">32</xref>] appearing in year 2000&#x2014;in which a spatially-discrete, temperally-continuous recurrent neural network was used with a rectified linear function, as expressed in Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>)&#x2014;through Ref. [<xref ref-type="bibr" rid="ref-36">36</xref>]. On the other hand, prior its introduction in deep neural networks, rectified linear unit had been used in neuroscience since at least 1995, but [<xref ref-type="bibr" rid="ref-327">327</xref>] was a book, as cited in [<xref ref-type="bibr" rid="ref-113">113</xref>]. Research results published in papers would appear in book form several years later:</p>
<disp-quote><p>&#x201C; The current and third wave, deep learning, started around 2006 (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007a) and is just now appearing in book form as of 2016. The other two waves [cybernetics and connectionism] similarly appeared in book form much later than the corresponding scientific activity occurred&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 13.</p>
<fig id="fig-138">
<label>Figure 138</label>
<caption><title><italic>Crayfish Giant Motor Synapse</italic> (Section <xref ref-type="sec" rid="s13_3_2">13.3.2</xref>). The response in SubFigure (a) is similar to that of a rectifier circuit with leaky diode in Figure <xref ref-type="fig" rid="fig-25">25</xref> and Figure <xref ref-type="fig" rid="fig-29">29</xref> [red curve (<inline-formula id="ieqn-2044"><mml:math id="mml-ieqn-2044"><mml:mo>&#x2212;</mml:mo><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mi>D</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mi>R</mml:mi></mml:msub></mml:math></inline-formula>) in SubFigure (b)] [<xref ref-type="bibr" rid="ref-326">326</xref>]. (Figure reproduced with permission of the publisher Wiley.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-138.tif"/>
</fig>   
</disp-quote><p>Another clue that the rectified linear function was a well-known, well accepted concept&#x2014;similar to the relation <inline-formula id="ieqn-2863"><mml:math id="mml-ieqn-2863"><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">F</mml:mi></mml:mrow></mml:math></inline-formula> in the finite element method&#x2014;is that the authors of [<xref ref-type="bibr" rid="ref-32">32</xref>] did not provide any reference to their own important Eq. (<xref ref-type="disp-formula" rid="eqn-1">1</xref>), which is reproduced in Eq. (<xref ref-type="disp-formula" rid="eqn-510">510</xref>), as if it was already obvious to anyone in neuroscience.</p>
<p>Indeed, more than sixty years ago, in a series of papers [<xref ref-type="bibr" rid="ref-58">58</xref>] [<xref ref-type="bibr" rid="ref-326">326</xref>] [<xref ref-type="bibr" rid="ref-59">59</xref>], Furshpan &amp; Potter established that current flows through a crayfish neuron synapse (Figure <xref ref-type="fig" rid="fig-136">136</xref> and Figure <xref ref-type="fig" rid="fig-137">137</xref>) in essentially one direction, thus deducing that the synapse can be modeled as a rectifier, diode in series with resistance, as shown in Figure <xref ref-type="fig" rid="fig-138">138</xref>.</p>

</sec>
<sec id="s13_3_3"><label>13.3.3</label>
<title>New active functions</title>
<p>&#x201C;The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles,&#x201D; [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 186. Indeed, the Swish activation function in Figure <xref ref-type="fig" rid="fig-139">139</xref> of the form</p>
<p><disp-formula id="eqn-517"><label>(517)</label><mml:math id="mml-eqn-517" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>with <inline-formula id="ieqn-2864"><mml:math id="mml-ieqn-2864"><mml:mrow><mml:mi>&#x1D530;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> being the logistic sigmoid given in Figure <xref ref-type="fig" rid="fig-30">30</xref> and in Eq. (<xref ref-type="disp-formula" rid="eqn-113">113</xref>), was found in [<xref ref-type="bibr" rid="ref-36">36</xref>] to outperform the rectified linear unit (ReLU) in a number of benchmark tests.</p>
<p>On the other hand, it would be hard to beat the efficiency of the rectified linear function in both evaluating the weighted combination of inputs <inline-formula id="ieqn-2865"><mml:math id="mml-ieqn-2865"><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">z</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> of layer <inline-formula id="ieqn-2866"><mml:math id="mml-ieqn-2866"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and in computing the gradient with the first derivative of ReLU being the Heaviside function; see Figure <xref ref-type="fig" rid="fig-24">24</xref>.</p>

<fig id="fig-139">
<label>Figure 139</label>
<caption><title><italic>Swish function</italic> (Section <xref ref-type="sec" rid="s13_3_3">13.3.3</xref>) <inline-formula id="ieqn-2045"><mml:math id="mml-ieqn-2045"><mml:mi>x</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mi mathvariant="fraktur">s</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with <inline-formula id="ieqn-2046"><mml:math id="mml-ieqn-2046"><mml:mrow><mml:mi>&#x1D530;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> being the logistic sigmoid in Figure <xref ref-type="fig" rid="fig-30">30</xref>, and other activation functions [<xref ref-type="bibr" rid="ref-36">36</xref>]. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-139.tif"/>
</fig>
<p>A zoo of activation functions is provided in &#x201C;Activation function&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Activation_function&amp;oldid=870454307">version 22:46, 24 November 2018</ext-link> and the more recent <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Activation_function&amp;oldid=1099333527">version 06:30, 20 July 2022</ext-link>, in which several active functions had been removed, e.g., the &#x201C;Square Nonlinearity (SQNL)&#x201D;<xref ref-type="fn" rid="fn321"><sup>321</sup></xref><fn id="fn321"><label>321</label><p>The &#x201C;Square Nonlinearity (SQNL)&#x201D; activation, having a shape similar to that of the hyperbolic tangent function, appeared in the article &#x201C;Activation function&#x201D; for the last time in <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Activation_function&amp;oldid=1012711871">version 22:00, 17 March 2021</ext-link>, and was was removed from the table of activation functions starting from <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Activation_function&amp;oldid=1017248224">version 18:13, 11 April 2021</ext-link> with the comment &#x201C;Remove SQLU since it has 0 citations; it needs to be broadly adopted to be in this list; Remove SQNL (also from the same author, and this also does not have broad adoption)&#x201D;; see the article <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Activation_function&amp;offset=&amp;limit=500&amp;action=history">History</ext-link>.</p></fn> listed in the 2018 version of this zoo.</p></sec></sec>
<sec id="s13_4"><label>13.4</label>
<title>Back-propagation, automatic differentiation</title>
<p>&#x201C;At its core, backpropagation [Section <xref ref-type="sec" rid="s5">5</xref>] is simply an efficient and exact method for calculating all the derivatives of a single target quantity (such as pattern classification error) with respect to a large set of input quantities (such as the parameters or weights in a classification rule)&#x201D; [<xref ref-type="bibr" rid="ref-328">328</xref>].</p>
<p>In a survey on automatic differentiation in [<xref ref-type="bibr" rid="ref-329">329</xref>], it was stated that: &#x201C;in simplest terms, backpropagation models learning as gradient descent in neural network weight space, looking for the minima of an objective function.&#x201D;</p>
<p>Such statement identified back-propagation with an optimization method by gradient descent. But according to the authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 198, &#x201C;back-propagation is often misunderstood as meaning the whole learning algorithm for multilayer neural networks&#x201D;, and clearly distinguish back-propagation <italic>only</italic> as a method to compute the gradient of the cost function with respect to the parameters, while another algorithm, such as stochastic gradient descent, is used to perform the learning using this gradient, where performing &#x201C;learning&#x201D; meant network training, i.e., find the parameters that minimize the cost function, which &#x201C;typically includes a performance measure evaluated on the entire training set as well as additional regularization terms.&#x201D;<xref ref-type="fn" rid="fn322"><sup>322</sup></xref><fn id="fn322"><label>322</label><p>See [<xref ref-type="bibr" rid="ref-78">78</xref>], Chap. 8, &#x201C;Optimization for training deep models&#x201D;, p. 267.</p></fn></p>
<p>According to [<xref ref-type="bibr" rid="ref-329">329</xref>], automatic differentiation, or in short &#x201C;autodiff&#x201D;, is &#x201C;a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs.&#x201D;</p>
<sec id="s13_4_1"><label>13.4.1</label>
<title>Back-propagation</title>
<p>In an interview published in 2018 [<xref ref-type="bibr" rid="ref-79">79</xref>], Hinton confirmed that backpropagation was independently invented by many people before his own 1986 paper [<xref ref-type="bibr" rid="ref-22">22</xref>]. Here we focus on information that is not found in the review of backpropagation in [<xref ref-type="bibr" rid="ref-12">12</xref>].</p>
<p>For example, the success reported in [<xref ref-type="bibr" rid="ref-22">22</xref>] laid not in backpropagation itself, but in its use in psychology:</p>
<disp-quote><p>&#x201C;Back in the mid-1980s, when computers were very slow, I used a simple example where you would have a family tree, and I would tell you about relationships within that family tree. I would tell you things like Charlotte&#x2019;s mother is Victoria, so I would say Charlotte and mother, and the correct answer is Victoria. I would also say Charlotte and father, and the correct answer is James. Once I&#x2019;ve said those two things, because it&#x2019;s a very regular family tree with no divorces, you could use conventional AI to infer using your knowledge of family relations that Victoria must be the spouse of James because Victoria is Charlotte&#x2019;s mother and James is Charlotte&#x2019;s father. The neural net could infer that too, but it didn&#x2019;t do it by using rules of inference, it did it by learning a bunch of features for each person. Victoria and Charlotte would both be a bunch of separate features, and then by using interactions between those vectors of features, that would cause the output to be the features for the correct person. From the features for Charlotte and from the features for mother, it could derive the features for Victoria, and when you trained it, it would learn to do that. The most exciting thing was that for these different words, it would learn these feature vectors, and it was learning distributed representations of words.&#x201D; [<xref ref-type="bibr" rid="ref-79">79</xref>]</p>
</disp-quote><p>For psychologists, &#x201C;a learning algorithm that could learn representations of things was a big breakthrough,&#x201D; and Hinton&#x2019;s contribution in [<xref ref-type="bibr" rid="ref-22">22</xref>] was to show that &#x201C;backpropagation would learn these distributed representations, and that was what was interesting to psychologists, and eventually, to AI people.&#x201D; But backpropagation lost ground to other technologies in machine learning:</p>
<disp-quote><p>&#x201C;In the early 1990s, ... the support vector machine did better at recognizing handwritten digits than backpropagation, and handwritten digits had been a classic example of backpropagation doing something really well. Because of that, the machine learning community really lost interest in backpropagation&#x201D; [<xref ref-type="bibr" rid="ref-79">79</xref>].<xref ref-type="fn" rid="fn323"><sup>323</sup></xref><fn id="fn323"><label>323</label><p>See Footnote <xref ref-type="fn" rid="fn31">31</xref> on how research on kernel methods (Section <xref ref-type="sec" rid="s8">8</xref>) for Support Vector Machines have been recently used in connection with networks with infinite width to understand how deep learning works (Section <xref ref-type="sec" rid="s14_2">14.2</xref>).</p></fn></p>
</disp-quote><p>Despite such setback, psychologists still considered backpropagation as an interesting approach, and continued to work with this method:</p>
<disp-quote><p>There is &#x201C;a distinction between AI and machine learning on the one hand, and psychology on the other hand. Once backpropagation became popular in 1986, a lot of psychologists got interested in it, and they didn&#x2019;t really lose their interest in it, they kept believing that it was an interesting algorithm, maybe not what the brain did, but an interesting way of developing representations&#x201D; [<xref ref-type="bibr" rid="ref-79">79</xref>].</p>
</disp-quote><p>The 2015 review paper [<xref ref-type="bibr" rid="ref-12">12</xref>] referred to Werbos&#x2019; 1974 PhD dissertation for a preliminary discussion of backpropagation (BP),</p>
<disp-quote><p>&#x201C;Efficient BP was soon explicitly used to minimize cost functions by adapting control parameters (weights) (Dreyfus, 1973). Compare some preliminary, NN-specific discussion (Werbos, 1974, Section 5.5.1), a method for multilayer threshold NNs (Bobrowski, 1978), and a computer program for automatically deriving and implementing BP for given differentiable systems (Speelpenning, 1980).&#x201D;</p>
</disp-quote><p>and explicitly attributed to Werbos early applications of backpropagation in neural networks (NN):</p>
<disp-quote><p>&#x201C;To my knowledge, the first NN-specific application of efficient backpropagation was described in 1981 (Werbos, 1981, 2006). Related work was published several years later (LeCun, 1985, 1988; Parker, 1985). A paper of 1986 significantly contributed to the popularization of BP for NNs (Rumelhart, Hinton, &amp; Williams, 1986), experimentally demonstrating the emergence of useful internal representations in hidden layers.&#x201D;</p>
</disp-quote><p>See also [<xref ref-type="bibr" rid="ref-112">112</xref>] [<xref ref-type="bibr" rid="ref-328">328</xref>] [<xref ref-type="bibr" rid="ref-330">330</xref>]. The 1986 paper mentioned above was [<xref ref-type="bibr" rid="ref-22">22</xref>].</p>
</sec>
<sec id="s13_4_2"><label>13.4.2</label>
<title>Automatic differentiation</title>
<p>The authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p. 214, wrote of backprop as a particular case of automatic differentiation (AD):</p>
<disp-quote><p>&#x201C;The deep learning community has been somewhat isolated from the broader computer science community and has largely developed its own cultural attitudes concerning how to perform differentiation. More generally, the field of <bold>automatic differentiation</bold> is concerned with how to compute derivatives algorithmically. The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called <bold>reverse mode accumulation</bold>.&#x201D;</p>
</disp-quote><p>Let&#x2019;s decode what was said above. The deep learning community was isolated because it was not in the mainstream of computer science research during the last AI winter, as Hinton described in an interview published in 2018 [<xref ref-type="bibr" rid="ref-79">79</xref>]:</p>
<disp-quote><p>&#x201C;This was at a time when all of us would have been a bit isolated in a fairly hostile environment&#x2014;the environment for deep learning was fairly hostile until quite recently&#x2014;it was very helpful to have this funding that allowed us to spend quite a lot of time with each other in small meetings, where we could really share unpublished ideas.&#x201D;</p>
</disp-quote><p>If Hinton did not move from Carnegie Mellon University in the US to the University of Toronto in Canada, it would be necessary for him to change research topic to get funding, AI winter would last longer, and he may not get the Turing Award along with LeCun and Bengio [<xref ref-type="bibr" rid="ref-331">331</xref>]:</p>
<disp-quote><p>&#x201C;The Turing Award, which was introduced in 1966, is often called the Nobel Prize of computing, and it includes a &#x0024;1 million prize, which the three scientists will share.&#x201D;</p>
</disp-quote><p>A recent review of AD is given in [<xref ref-type="bibr" rid="ref-329">329</xref>], where backprop was described as a particular case of AD, known as &#x201C;reverse mode AD&#x201D;; see also [<xref ref-type="bibr" rid="ref-12">12</xref>].<xref ref-type="fn" rid="fn324"><sup>324</sup></xref><fn id="fn324"><label>324</label><p>We only want to point out the connection between backprop and AD, together with a recent review paper on AD, but will not review AD itself here.</p></fn></p>
</sec> </sec>
<sec id="s13_5"><label>13.5</label>
<title>Resurgence of AI and current state</title>
<p>The success of deep neural networks in the ImageNet competitions since 2012, particularly when they surpassed human-level performance in 2015 (See Figure <xref ref-type="fig" rid="fig-3">3</xref> and Section <xref ref-type="sec" rid="s5_3_3">5.3.3</xref> on Parametric ReLU), was preceded by their success in speech recognition, as recounted by Hinton in a 2018 interview [<xref ref-type="bibr" rid="ref-79">79</xref>]:</p>

<disp-quote><p>&#x201C;For computer vision, 2012 was the inflection point. For speech, the inflection point was a few years earlier. Two different graduate students at Toronto showed in 2009 that you could make a better speech recognizer using deep learning. They went as interns to IBM and Microsoft, and a third student took their system to Google. The basic system that they had built was developed further, and over the next few years, all these companies&#x2019; labs converted to doing speech recognition using neural nets. Many of the best people in speech recognition had switched to believing in neural networks before 2012, but the big public impact was in 2012, when the vision community, almost overnight, got turned on its head and this crazy approach turned out to win.&#x201D;</p>
</disp-quote><p>The mentioned 2009 breakthrough of applying deep learning to speech recognition did not receive much of the non-technical press as the 2012 breakthrough in computer vision (e.g., [<xref ref-type="bibr" rid="ref-75">75</xref>] [<xref ref-type="bibr" rid="ref-74">74</xref>]), and was thus not popularly known, except inside the deep-learning community.</p>
<p>Deep learning is being developed and used to guide consumers in nutrition [<xref ref-type="bibr" rid="ref-332">332</xref>]:</p>
<disp-quote><p>&#x201C;Using machine learning, a subtype of artificial intelligence, the billions of data points were analyzed to see what drove the glucose response to specific foods for each individual. In that way, an algorithm was built without the biases of the scientists.</p>
<p>There are other efforts underway in the field as well. In some continuing nutrition studies, smartphone photos of participants&#x2019; plates of food are being processed by deep learning, another subtype of A.I., to accurately determine what they are eating. This avoids the hassle of manually logging in the data and the use of unreliable food diaries (as long as participants remember to take the picture).</p>
<p>But that is a single type of data. What we really need to do is pull in multiple types of data&#x2014;activity, sleep, level of stress, medications, genome, microbiome and glucose&#x2014;from multiple devices, like skin patches and smartwatches. With advanced algorithms, this is eminently doable. In the next few years, you could have a virtual health coach that is deep learning about your relevant health metrics and providing you with customized dietary recommendations.&#x201D;</p>
</disp-quote>
<sec id="s13_5_1"><label>13.5.1</label>
<title>COVID-19 machine-learning diagnostics and prognostics</title>
<p>While it is not possible to review the vast number of papers on deep learning, it would be an important omission if we did not mention a most urgent issue of our times,<xref ref-type="fn" rid="fn325"><sup>325</sup></xref><fn id="fn325"><label>325</label><p>As of 2020.12.18, the COVID-19 pandemic was still raging across the entire United States.</p></fn> the COVID-19 (COronaVIrus Disease 2019) pandemic, and how deep learning could help in the diagnostics and prognostics of Covid-19.</p>
<p><bold>Some reviews of Covid-19 models and software.</bold> The following sweeping assertion was made in a 2021 MIT Technology Review article titled &#x201C;Hundreds of AI tools have been built to catch covid. None of them helped&#x201D; [<xref ref-type="bibr" rid="ref-334">334</xref>], based on two 2021 papers that reviewed and appraised the validity and usefulness of Covid-19 models for diagnostics (i.e., detecting Covid-19 infection) and for prognostics (i.e., forecasting the course of Covid-19 in patients) [<xref ref-type="bibr" rid="ref-335">335</xref>] and [<xref ref-type="bibr" rid="ref-336">336</xref>]:</p>
<disp-quote><p>&#x201C;The clear consensus was that AI tools had made little, if any, impact in the fight against covid.&#x201D;</p>
</disp-quote><p>A large collection of 37,421 titles (published and preprint reports) on Covid-19 models up to July 2020 were examined in [<xref ref-type="bibr" rid="ref-335">335</xref>], where only 169 studies describing 232 prediction models were selected based on CHARMS (CHecklist for critical Appraisal and data extraction for Systematic Reviews of prediction Modeling Studies) [<xref ref-type="bibr" rid="ref-337">337</xref>] for detailed analysis, with the risk of bias assessed using PROBAST (Pediction model Risk Of Bias ASsessment Tool) [<xref ref-type="bibr" rid="ref-338">338</xref>]. A follow-up study [<xref ref-type="bibr" rid="ref-336">336</xref>] examined 2,215 titles up to Oct 2020, using the same methodology as in [<xref ref-type="bibr" rid="ref-335">335</xref>] with the added requirement of &#x201C;sufficiently documented methodologies&#x201D;, to narrow down to 62 titles for review &#x201C;in most details&#x201D;. In the words of the lead developer of PROBAST, &#x201C;unfortunately&#x201D; journals outside the medical field were not included since it would be a &#x201C;surprise&#x201D; that &#x201C;the reporting and conduct of AI health models is better outside the medical literature&#x201D;.<xref ref-type="fn" rid="fn326"><sup>326</sup></xref><fn id="fn326"><label>326</label><p>Private communication with Karel (Carl) Moons on 2021 Oct 28. In other words, only medical journals included in PROBAST would report Covid-19 models that cannot be beaten by models reported in non-medical journals, such as in [<xref ref-type="bibr" rid="ref-333">333</xref>], which was indeed not &#x201C;fit for clinical use&#x201D; to use the same phrase in [<xref ref-type="bibr" rid="ref-334">334</xref>].</p></fn></p>

<fig id="fig-140">
<label>Figure 140</label>
<caption><title><italic>MIT COVID-19 diagnosis by cough recordings</italic>. Machine learning architecture. Audio Mel Frequency Cepstrum Coefficients (MFCC) as input. Each cough signal is split into 6 audio chunks, processed by the MFCC package, then passed through the Biomarker 1 to check on muscular degradation. The output of Biomarker 1 is input into each of the three Convolutional Neural Networks (CNNs), representing Biomarker 2 (Vocal cords), Biomarker 3 (Lungs &amp; Respiratory Tract), Biomarker 4 (Sentiment). The outputs of these CNNs are concatenated and &#x201C;pooled&#x201D; together to serve as (1) input for &#x201C;Competing Aggregator Models&#x201D; to produce a &#x201C;longitudinal saliency map&#x201D;, and as (2) input for a deep and dense network with ReLU activation, followed by a &#x201C;binary dense layer&#x201D; with sigmoid activation to produce Covid-19 diagnosis. [<xref ref-type="bibr" rid="ref-333">333</xref>]. <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/.tif">(CC By 4.0)</ext-link></title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-140.tif"/>
</fig>
<p><bold>Covid-19 diagnosis from cough recordings.</bold> MIT researchers developed a cough-test smartphone app that diagnoses Covid-19 from cough recordings [<xref ref-type="bibr" rid="ref-333">333</xref>], and claimed that their app achieved excellent results:<xref ref-type="fn" rid="fn327"><sup>327</sup></xref><fn id="fn327"><label>327</label><p>&#x201C;In medical diagnosis, test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).&#x201D; See &#x201C;Sensitivity and specificity&#x201D;, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Sensitivity_and_specificity&amp;oldid=1008200627">Wikipedia version 02:21, 22 February 2021</ext-link>. For the definition of &#x201C;AUC&#x201D; (Area Under the ROC Curve), with &#x201C;ROC&#x201D; abbreviating for &#x201C;Receiver Operating characteristic Curve&#x201D;, see &#x201C;Classification: ROC Curve and AUC&#x201D;, in &#x201C;Machine Learning Crash Course&#x201D;, <ext-link ext-link-type="uri" xlink:href="https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc">Website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20210129220610/https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc">Internet archive</ext-link>.</p></fn></p>
<disp-quote><p>&#x201C;When validated with subjects diagnosed using an official test, the model achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC: 0.97). For asymptomatic subjects it achieves sensitivity of 100% with a specificity of 83.2%.&#x201D; [<xref ref-type="bibr" rid="ref-333">333</xref>].</p>
</disp-quote><p>making one wondered why it had not been made available for use by everyone, since &#x201C;These inventions could help our coronavirus crisis now. But delays mean they may not be adopted until the worst of the pandemic is behind us&#x201D; [<xref ref-type="bibr" rid="ref-339">339</xref>].<xref ref-type="fn" rid="fn328"><sup>328</sup></xref><fn id="fn328"><label>328</label><p>One author of the present article (LVQ), more than one year after the preprint of [<xref ref-type="bibr" rid="ref-333">333</xref>], still spit into a tube for Covid test instead of coughing into a phone.</p></fn></p>
<p>Unfortunately, we suspected that the model in [<xref ref-type="bibr" rid="ref-333">333</xref>] was also &#x201C;not fit for clinical use&#x201D; as described in [<xref ref-type="bibr" rid="ref-334">334</xref>], because it has not been put to use in the real world as of 2022 Jan 12 (we were still spitting saliva into a tube instead of coughing into our phone). In addition, despite being contacted three times regarding the lack of transparency in the description of the model in [<xref ref-type="bibr" rid="ref-333">333</xref>], in particular the &#x201C;Competing Aggregator Models&#x201D; in Figure <xref ref-type="fig" rid="fig-140">140</xref>, the authors of [<xref ref-type="bibr" rid="ref-333">333</xref>] did not respond to our repeated inquiries, confirming the criticism described in Section <xref ref-type="sec" rid="s14_7">14.7</xref> on &#x201C;Lack of transparency and irreproducibility of results&#x201D; of AI models.</p>
<p>Our suspicion was confirmed when we found the critical review paper [<xref ref-type="bibr" rid="ref-340">340</xref>], in which the pitfalls of the model in [<xref ref-type="bibr" rid="ref-333">333</xref>], among other cough audio models, were pointed out, with the single most important question being: Were the audio representations in these machine-learning models, even though correlated with Covid-19 in their respective datasets, the true audio biomarkers originated from Covid-19? The seven grains of salt (pitfalls) listed in [<xref ref-type="bibr" rid="ref-340">340</xref>] were:</p>
<list list-type="simple">
<list-item><label>(1)</label><p>Machine-learning models did not detect Covid-19, but only distinguished between healthy people and sick people, a not so useful task.</p></list-item>
<list-item><label>(2)</label><p>Surrounding acoustic environment may introduce biases into the cough sound recordings, e.g., Covid-19 positive people tend to stay indoors, and Covid-19 negative people outdoors.</p></list-item>
<list-item><label>(3)</label><p>Participants providing coughs for the datasets may know their Covid-19 status, and that knowledge would affect their emotion, and hence the machine learning models.</p></list-item>
<list-item><label>(4)</label><p>The machine-learning models can only be as accurate as the cough recording labels, which may not be valid since participants self reported their Covid-19 status.</p></list-item>
<list-item><label>(5)</label><p>Most researchers, like the authors of [<xref ref-type="bibr" rid="ref-333">333</xref>], don&#x2019;t share codes and datasets, or even information on their method as mentioned above; see also Section <xref ref-type="sec" rid="s14_7">14.7</xref> &#x201C;Lack of transparency&#x201D;.</p></list-item>
<list-item><label>(6)</label><p>The influence of factors such as comorbidity, ethnicity, geography, socio-economics, on Covid-19 is complex and unequal, and could introduce biases in the datasets.</p></list-item>
<list-item><label>(7)</label><p>Lack of population control (participant identity not recorded) led to non-disjoint training set, development set, and test set.</p></list-item></list>
<p><bold>Other Covid-19 machine-learning models.</bold> A comprehensive review of machine learning for Covid-19 diagnosis based on medical-data collection, preprocessing of medical images, whose features are extracted, and classified is provided in [<xref ref-type="bibr" rid="ref-341">341</xref>], where methods based on cough sound recordings were not included. Seven methods were reviewed in detail: (1) transfer learning, (2) ensemble learning, (3) unsupervised learning and (4) semi-supervised learning, (5) convolutional neural networks, (6) graph neural networks, (7) explainable deep neural networks.</p>
<p>In [<xref ref-type="bibr" rid="ref-342">342</xref>], deep-learning methods together with transfer learning were reviewed for classification and detection of Covid-19 based on chest X-ray, computer-tomography (CT) images, and lung-ultrasound images. Also reviewed were machine-learning methods for selection of vaccine candidates, natural-language-processing methods to analyze public sentiment during the pandemic.</p>
<p>For multi-disease (including Covid-19) prediction, methods based on (1) logistic regression, (2) machine learning, and in particular (3) deep learning were reviewed, with difficulties encountered forming a basis for future developments pointed out, in [<xref ref-type="bibr" rid="ref-343">343</xref>].</p>
<p>Information on the collection of genes, called genotype, related to Covid-19, was predicted by searching and scoring similarities between the seed genes (obtained from prior knowledge) and candidate genes (obtained from the biomedical literature) with the goal to establish the molecular mechanism of Covid-19 [<xref ref-type="bibr" rid="ref-344">344</xref>].</p>
<p>In [<xref ref-type="bibr" rid="ref-345">345</xref>], the proteins associated with Covid-19 were predicted using ligand<xref ref-type="fn" rid="fn329"><sup>329</sup></xref><fn id="fn329"><label>329</label><p>A ligand is &#x201C;usually a molecule which produces a signal by binding to a site on a target protein,&#x2019; see &#x201C;Ligand (biochemistry)&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Ligand_(biochemistry)&amp;oldid=1059255240">version 11:08, 8 December 2021</ext-link>.</p></fn> designing and modecular modeling.</p>
<p>In [<xref ref-type="bibr" rid="ref-346">346</xref>], after evaluating various computer-science techniques using Fuzzy-Analytic Hierarchy Process integrated with the Technique for Order Performance by Similar to Ideal Solution, it was recommended to use Blockchain as the most effective technique to be used by healthcare workers to address Covid-19 problems in Saudi Arabia.</p>
<p>Other Covid-19 machine learning models include the use of regression algorithms for real-time analysis of Covid-19 pandemic [<xref ref-type="bibr" rid="ref-347">347</xref>], forecasting the number of infected people using the logistic growth curve and the Gompertz growth curve [<xref ref-type="bibr" rid="ref-348">348</xref>], a generalization of the SEIR<xref ref-type="fn" rid="fn330"><sup>330</sup></xref><fn id="fn330"><label>330</label><p>SEIR = Susceptible, Exposed, Infectious, Recovered; see &#x201C;Compartmental models in epidemiology&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Compartmental_models_in_epidemiology&amp;oldid=1072808123">version 15:44, 19 February 2022</ext-link>.</p></fn> model and logistic regression for forecasting [<xref ref-type="bibr" rid="ref-349">349</xref>].</p>
</sec>
<sec id="s13_5_2"><label>13.5.2</label>
<title>Additional applications of deep learning</title>
<p>The use of deep learning as one of several machine-learning techniques for Covid-19 diagnosis was reviewed in [<xref ref-type="bibr" rid="ref-341">341</xref>] [<xref ref-type="bibr" rid="ref-342">342</xref>] [<xref ref-type="bibr" rid="ref-343">343</xref>], as mentioned above.</p>
<p>By growing and pruning deep learning neural networks (DNNs), optimal parameters, such as number of hidden layers, number of neurons, and types of activation functions, were obtained for the diagnosis of Parkinson&#x2019;s disease, with 99.34% accuracy on test data, compared to previous DNNs using specific or random number of hidden layers and neurons [<xref ref-type="bibr" rid="ref-350">350</xref>].</p>
<p>In [<xref ref-type="bibr" rid="ref-351">351</xref>], a deep residual network, with gridded interpolation and Swish activation function (see Section <xref ref-type="sec" rid="s13_3_3">13.3.3</xref>), was constructed to generate a single high-resolution image from many low-resolution images obtained from Fundus Fluorescein Angiography (FFA),<xref ref-type="fn" rid="fn331"><sup>331</sup></xref><fn id="fn331"><label>331</label><p>Fundus is &#x201C;the interior surface of the eye opposite the lens and includes the retina, optic disc, macula, fovea, and posterior pole&#x201D; (Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Fundus_(eye)&amp;oldid=934540306">version 02:49, 7 January 2020</ext-link>). Fluorescein is an organic compound and fluorescent dye (Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Fluorescein&amp;oldid=1064139571">version 19:51, 6 January 2022</ext-link>). Angiography (angio- &#x201C;blood vessel&#x201D; + graphy &#x201C;write, record&#x201D;, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/w/index.php?title=Angiography&amp;oldid=1069444815">version 10:19, 2 February 2022</ext-link>) is a medical procedure to visualize the flow of blood (or other biological fluid) by injecting a dye and by using a special camera.</p></fn> resulting in &#x201C;superior performance metrics and computational time.&#x201D;</p>
<p>Going beyond the use of Proper Orthogonal Decomposition and Generalized Falk Method [<xref ref-type="bibr" rid="ref-352">352</xref>], a hierarchichal deep-learning neural network was proposed in [<xref ref-type="bibr" rid="ref-353">353</xref>] to be used with the Proper Generalized Decomposition as a model-order-reduction method applied to finite element models.</p>
<p>To develop high precision model to forecast wind speed and wind power, which depend on the conditions of the nearby &#x201C;atmospheric pressure, temperature, roughness, and obstacles&#x201D;, the authors of [<xref ref-type="bibr" rid="ref-354">354</xref>] applied &#x201C;deep learning, reinforcement learning and transfer learning.&#x201D; The challenges in this area are the randomness, the instantaneity, and the seasonal characteristics of wind and the atmosphere.</p>
<p>Self-driving cars must deal with a large variety of real scenarios and of real behaviors, which deep-learning perception-action models should learn to become robust. But due to a limition of the data, it was proposed in [<xref ref-type="bibr" rid="ref-355">355</xref>] to use a new image style transfer method to generate more varieties in data by modifying texture, contrast ratio and image color, and then extended to scenarios that were unobserved before.</p>
<p>Other applications of deep learning include a real-time maskless-face detector using deep residual networks [<xref ref-type="bibr" rid="ref-356">356</xref>], topology optimization with embedded physical law and physical constraints [<xref ref-type="bibr" rid="ref-357">357</xref>], prediction of stress-strain relations in granular materials from triaxial test results [<xref ref-type="bibr" rid="ref-358">358</xref>], surrogate model for flight-load analysis [<xref ref-type="bibr" rid="ref-359">359</xref>], classification of domestic refuse in medical institutions based on transfer learning and convolutional neural network [<xref ref-type="bibr" rid="ref-360">360</xref>], convolutional neural network for arrhythmia diagnosis [<xref ref-type="bibr" rid="ref-361">361</xref>], e-commerce dynamic pricing by deep reinforcement learning [<xref ref-type="bibr" rid="ref-362">362</xref>], network intrusion detection [<xref ref-type="bibr" rid="ref-363">363</xref>], road pavement distress detection for smart maintenance [<xref ref-type="bibr" rid="ref-364">364</xref>], traffic flow statistics [<xref ref-type="bibr" rid="ref-365">365</xref>], multi-view gait recognition using deep CNN and channel attention mechanism [<xref ref-type="bibr" rid="ref-366">366</xref>], mortality risk assessment of ICU patients [<xref ref-type="bibr" rid="ref-367">367</xref>], stereo matching method based on space-aware network model to reduce the limitation of GPU RAM [<xref ref-type="bibr" rid="ref-368">368</xref>], air quality forecasting in Internet of Things [<xref ref-type="bibr" rid="ref-369">369</xref>], analysis of cardiac disease abnormal ECG signals [<xref ref-type="bibr" rid="ref-370">370</xref>], detection of mechanical parts (nuts, bolts, gaskets, etc.) by machine vision [<xref ref-type="bibr" rid="ref-371">371</xref>], asphalt road crack detection [<xref ref-type="bibr" rid="ref-372">372</xref>], steel commondity selection using bidirectional encoder representations from transformers (BERT) [<xref ref-type="bibr" rid="ref-373">373</xref>], short-term traffic flow prediction using LSTM-XGBoost combination model [<xref ref-type="bibr" rid="ref-374">374</xref>], emotion analysis based on multi-channel CNN in social networks [<xref ref-type="bibr" rid="ref-375">375</xref>].</p>

<fig id="fig-141">
<label>Figure 141</label>
<caption><title><italic>Tesla Full-Self-Driving (FSD) controversy</italic> (Section <xref ref-type="sec" rid="s14_1">14.1</xref>). <italic>Left</italic>: Tesla in FSD mode hit a child-size mannequin, repeatedly in safety tests by The Dawn Project, a software competitor to Tesla, 2022.08.09 [<xref ref-type="bibr" rid="ref-376">376</xref>] [<xref ref-type="bibr" rid="ref-377">377</xref>]. <italic>Right</italic>: Tesla in FSD mode went around a childsize mannequin at 15 mph in a residential area, 2022.08.14 [<xref ref-type="bibr" rid="ref-378">378</xref>] [<xref ref-type="bibr" rid="ref-379">379</xref>]. Would a prudent driver stop completely, waiting for the kid to move out of the road, before proceeding forward? The driver, a Tesla investor, did not use his own child, indicating that his maneuver was not safe.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-141.tif"/>
</fig>
</sec></sec></sec>
<sec id="s14"><label>14</label>
<title>Closure: Limitations and danger of AI</title>
<p>A goal of the present review paper is to bring first-time learners from the beginning level to as close as possible the research frontier in deep learning, with particular connection to, and application in, computational mechanics.</p>
<p>As concluding remarks, we collect here some known limitations and danger of AI in general, and deep learning in particular. As Hinton pointed out himself a limitation of generalization of deep learning [<xref ref-type="bibr" rid="ref-383">383</xref>]:</p>
<disp-quote><p>&#x201C;If a neural network is trained on images that show a coffee cup only from a side, for example, it is unlikely to recognize a coffee cup turned upside down.&#x201D;</p>
</disp-quote>
<sec id="s14_1"><label>14.1</label>
<title>Driverless cars, crewless ships, &#x201C;not any time soon&#x201D;</title>
<p>In 2016, the former U.S. secretary of transportation, Anthony Foxx, described a rosy future just five years down the road: &#x201C;By 2021, we will see autonomous vehicles in operation across the country in ways that we [only] imagine today&#x201D; [<xref ref-type="bibr" rid="ref-384">384</xref>].</p>
<p>On 2022.06.09, UPI reported that &#x201C;Automaker Hyundai and South Korean officials launched a trial service of self-driving taxis in the busy Seoul neighborhood of Gangnam,&#x201D; an event described as &#x201C;the latest step forward in the country&#x2019;s efforts to make autonomous vehicles an everyday reality. The new service, called RoboRide, features Hyundai Ioniq 5 electric cars equipped with Level 4 autonomous driving capabilities. The technology allows the taxis to move independently in real-life traffic without the need for human control, although a safety driver will remain in the car&#x201D; [<xref ref-type="bibr" rid="ref-385">385</xref>]. According to Huyndai, the safety driver &#x201C;only intervenes under limited conditions,&#x201D; which were explicitly not specified to the public, whereas the car itself would &#x201C;perceive, make decisions and control its own driving status.&#x201D;</p>

<fig id="fig-142">
<label>Figure 142</label>
<caption><title><italic>Tesla Full-Self-Driving (FSD) controversy</italic> (Section <xref ref-type="sec" rid="s14_1">14.1</xref>). The Tesla was about to run down the child-size mannequin at 23 mph, hitting it at 24 mph. The driver did not hold on, but only kept his hands close, to the driving wheel for safety, and did not put his foot on the accelerator. There were no cones on both sides of the road, and there was room to go around the mannequin. The weather was clear, sunny. The mannequin wore a bright safety jacket. Visibility was excellent, 2022.08.15 [<xref ref-type="bibr" rid="ref-380">380</xref>] [<xref ref-type="bibr" rid="ref-381">381</xref>].</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-142.tif"/>
</fig>
<p>But what is &#x201C;Level 4 autonomous driving&#x201D;? Let&#x2019;s look at the startup autononous-driving company Waymo. Their Level 4 consists of &#x201C;mapping the territory in a granular fashion (including lane markers, traffic signs and lights, curbs, and crosswalks). The solution incorporates both GPS signals and real-time sensor data to always determine the vehicle&#x2019;s exact location. Further, the system relies on more than 20 million miles of real-world driving and more than 20 billion miles in simulation, to allow the Waymo Driver to anticipate what other road users, pedestrians, or other objects might do&#x201D; [<xref ref-type="bibr" rid="ref-386">386</xref>].</p>
<p>Yet Level 4 is still far from Level 5, for which &#x201C;vehicles are fully automated with no need for the driver to do anything but set the destination and ride along. They can drive themselves anywhere under any conditions, safely&#x201D; [<xref ref-type="bibr" rid="ref-387">387</xref>], and would still be many years later away [<xref ref-type="bibr" rid="ref-386">386</xref>].</p>
<p>Indeed, exactly two months after Huyndai&#x2019;s announcement of their Level 4 test pilot program, on 2022.08.09, <italic>The Guardian</italic> reported that in a series of safety tests, a &#x201C;professional test driver using Tesla&#x2019;s Full Self-Driving mode repeatedly hit a child-sized mannequin in its path&#x201D; [<xref ref-type="bibr" rid="ref-377">377</xref>]; Figure <xref ref-type="fig" rid="fig-141">141</xref>, left. &#x201C;It&#x2019;s a lethal threat to all Americans, putting children at great risk in communities across the country,&#x201D; warned The Dawn Project&#x2019;s founder, Dan O&#x2019;Dowd, who described the test results as &#x201C;deeply disturbing,&#x201D; as the vehicle tended to &#x201C;mow down children at crossroads,&#x201D; and who argued for prohibiting Tesla vehicles from running in the street until Tesla self driving software could be proven safe.</p>
<p>The Dawn Project test results were contested by a Tesla investor, who posted a video on 2022.08.14 to prove that the Tesla Full-Self-Driving (FSD) system worked as advertized (Figure <xref ref-type="fig" rid="fig-141">141</xref>, right). The next day, 2022.08.15, Dan O&#x2019;Dowd posted a video proving that the Tesla under FSD mode ran over a child-size mannequin at 24 mph in clear weather, with excellent visibility, no cones on either side of the Tesla, and without the driver pressing his foot on the accelerator (Figure <xref ref-type="fig" rid="fig-142">142</xref>).</p>

<fig id="fig-143">
<label>Figure 143</label>
<caption><title><italic>Tesla crash</italic> (Section <xref ref-type="sec" rid="s14_1">14.1</xref>). July 2020. <italic>Left</italic>: &#x201C;Less than a half-second after [the Tesla driver] flipped on her turn signal, Autopilot started moving the car into the right lane and gradually slowed, video and sensor data showed.&#x201D; <italic>Right:</italic> &#x201C;Halfway through, the Tesla sensed an obstruction&#x2014;possibly a truck stopped on the side of the road&#x2014;and paused its lane change. The car then veered left and decelerated rapidly&#x201D; [<xref ref-type="bibr" rid="ref-382">382</xref>]. See also Figures <xref ref-type="fig" rid="fig-144">144</xref>, <xref ref-type="fig" rid="fig-145">145</xref>, <xref ref-type="fig" rid="fig-146">146</xref>. (Data and video provided by QuantivRisk.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-143.tif"/>
</fig>
<p>&#x201C;In June [2022], the National Highway Traffic Safety Administration (NHTSA), said it was expanding an investigation into 830,000 Tesla cars across all four current model lines. The expansion came after analysis of a number of accidents revealed patterns in the car&#x2019;s performance and driver behavior&#x201D; [<xref ref-type="bibr" rid="ref-377">377</xref>]. &#x201C;Since 2016, the agency has investigated 30 crashes involving Teslas equipped with automated driving systems, 19 of them fatal. NHTSA&#x2019;s Office of Defects Investigation is also looking at the company&#x2019;s autopilot technology in at least 11 crashes where Teslas hit emergency vehicles.&#x201D;</p>
<p>In 2019, it was reported that several car executives thought that driveless cars were still several years in the future because of the difficulty in anticipating human behavior [<xref ref-type="bibr" rid="ref-388">388</xref>]. The progress of Huyndai&#x2019;s driveless taxis has not solved the challenge of dealing with human behavior, as there was still a need for a &#x201C;safety driver.&#x201D;</p>
<p>&#x201C;On [2022] May 6, Lyft, the ride-sharing service that competes with Uber sold its Level 5 division, an autonomous-vehicle unit, to Woven Planet, a Toyota subsidiary. After four years of research and development, the company seems to realize that autonomous driving is a tough nut to crack&#x2014;much tougher than the team had anticipated.</p>
<p>&#x201C;Uber came to the same conclusion, but even earlier, in December. The company sold Advanced Technologies Group, its self-driving unit, to Aurora Innovation, citing high costs and more than 30 crashes, culminating in a fatality as the reason for cutting its losses.</p>
<p>&#x201C;Finally, several smaller companies, including Zoox, a robo-taxi company; Ike, an autonomous-trucking startup; and Voyage, a self-driving startup; have also passed the torch to companies with bigger budgets&#x201D; [<xref ref-type="bibr" rid="ref-384">384</xref>].</p>
<p>&#x201C;Those startups, like many in the industry, have underestimated the sheer difficulty of &#x201C;leveling up&#x201D; vehicle autonomy to the fabled Level 5 (full driving automation, no human required)&#x201D; [<xref ref-type="bibr" rid="ref-384">384</xref>].</p>
<p>On top of the difficulty in addressing human behavior, there were other problems, perhaps in principle less challenging, so we thought, as reported in [<xref ref-type="bibr" rid="ref-386">386</xref>]: &#x201C;widespread adoption of autonomous driving is still years away from becoming a reality, largely due to the challenges involved with the development of accurate sensors and cameras, as well as the refinement of algorithms that act upon the data captured by these sensors.</p>

<fig id="fig-144">
<label>Figure 144</label>
<caption><title><italic>Tesla crash</italic> (Section <xref ref-type="sec" rid="s14_1">14.1</xref>). July 2020. &#x201C;Less than a second after the Tesla has slowed to roughly 55 m.p.h. [<italic>Left</italic>], its rear camera shows a car rapidly approaching [<italic>Right</italic>]&#x201D; [<xref ref-type="bibr" rid="ref-382">382</xref>]. There were no moving cars on both lanes in front of the Tesla for a long distance ahead (perhaps a quarter of a mile). See also Figures <xref ref-type="fig" rid="fig-143">143</xref>, <xref ref-type="fig" rid="fig-145">145</xref>, <xref ref-type="fig" rid="fig-146">146</xref>. (Data and video provided by QuantivRisk.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-144.tif"/>
</fig>
<p>&#x201C;This process is extremely data-intensive, given the large variety of potential objects that could be encountered, as well as the near-infinite ways objects can move or react to stimuli (for example, road signs may not be accurately identified due to lighting conditions, glare, or shadows, and animals and people do not all respond the same way when a car is hurtling toward them).</p>
<p>&#x201C;Algorithms in use still have difficulty identifying objects in real-world scenarios; in one accident involving a Tesla Model X, the vehicle&#x2019;s sensing cameras failed to identify a truck&#x2019;s white side against a brightly lit sky.&#x201D;</p>
<p>In addition to Figure <xref ref-type="fig" rid="fig-141">141</xref>, another example was a Tesla crash in July 2020 in clear, sunny weather, with little clouds, as shown in Figures <xref ref-type="fig" rid="fig-143">143</xref>, <xref ref-type="fig" rid="fig-144">144</xref>, <xref ref-type="fig" rid="fig-145">145</xref>, <xref ref-type="fig" rid="fig-146">146</xref>. The self-driving system could not detect that a static truck was parked on the side of a highway, and due to the foward and changing-lane motion of the Tesla, the software could have thought that it was running into the truck, and veered left while rapidly decelerating to avoid collision with the truck. As a result, the Tesla was rear-ended by another fast coming car from behind on its left side [<xref ref-type="bibr" rid="ref-382">382</xref>].</p>
<p>&#x201C;Pony.ai is the latest autonomous car company to make headlines for the wrong reasons. It has just lost its permit to test its fleet of autonomous vehicles in California over concerns about the driving record of the safety drivers it employs. It&#x2019;s a big blow for the company, and highlights the interesting spot the autonomous car industry is in right now. After a few years of very bad publicity, a number of companies have made real progress in getting self-driving cars on the road&#x201D; [<xref ref-type="bibr" rid="ref-389">389</xref>].</p>
<p>The 2022 article &#x201C;I&#x2019;m the Operator&#x2019;: The Aftermath of a Self-Driving Tragedy&#x201D; [<xref ref-type="bibr" rid="ref-390">390</xref>] described these &#x201C;few years of very bad publicity&#x201D; in stunning, tragic details about an Uber autonomous-vehicle operator, Rafela Vasquez, who did not take over the control of the vehicle in time, and killed a jaywalking pedestrian.</p>
<p>The classification software of the Uber autonomous driving system could not recognize the pedestrian, but vacillated between a &#x201C;vehicle&#x201D;, then &#x201C;other&#x201D;, then a &#x201C;bicycle&#x201D; [<xref ref-type="bibr" rid="ref-390">390</xref>].</p>
<p>&#x201C;At 2.6 seconds from the object, the system identified it as &#x2018;bicycle.&#x2019; At 1.5 seconds, it switched back to considering it &#x2018;other.&#x2019; Then back to &#x2018;bicycle&#x2019; again. The system generated a plan to try to steer around whatever it was, but decided it couldn&#x2019;t. Then, at 0.2 seconds to impact, the car let out a sound to alert Vasquez that the vehicle was going to slow down. At two-hundredths of a second before impact, traveling at 39 mph, Vasquez grabbed the steering wheel, which wrested the car out of autonomy and into manual mode. It was too late. The smashed bike scraped a 25-foot wake on the pavement. A person lay crumpled in the road&#x201D; [<xref ref-type="bibr" rid="ref-390">390</xref>].</p>

<fig id="fig-145">
<label>Figure 145</label>
<caption><title><italic>Tesla crash</italic> (Section <xref ref-type="sec" rid="s14_1">14.1</xref>). July 2020. The fast-coming blue car rear-ended the Tesla, indented its own front bumper, with flying broken glass (or clear plastic) cover shards captured by the Tesla rear camera [<xref ref-type="bibr" rid="ref-382">382</xref>]. See also Figures <xref ref-type="fig" rid="fig-143">143</xref>, <xref ref-type="fig" rid="fig-144">144</xref>, <xref ref-type="fig" rid="fig-146">146</xref>. (Data and video provided by QuantivRisk.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-145.tif"/>
</fig>
<p>The operator training program manager said &#x201C;I felt shame when I heard of a lone frontline employee has been singled out to be charged of negligent homicide with a dangerous instrument. We owed Rafaela better oversight and support. We also put her in a tough position.&#x201D; Another program manager said &#x201C;You can&#x2019;t put the blame on just that one person. I mean, it&#x2019;s absurd. Uber had to know this would happen. We get distracted in <italic>regular</italic> driving. It&#x2019;s not like somebody got into their car and decided to run into someone. They were working within a framework. And that framework created the conditions that allowed that to happen.&#x201D; [<xref ref-type="bibr" rid="ref-390">390</xref>].</p>
<p>After the above-mentioned fatality caused by an Uber autonomous car with a single operator in it, &#x201C;many companies temporarily took their cars off the road, and after it was revealed that only one technician was inside the Uber car, most companies resolved to keep two people in their test vehicles at all times&#x201D; [<xref ref-type="bibr" rid="ref-391">391</xref>]. Having two operators in a car would help to avoid accidents, but the pandemic social-distancing rule often prevented such arrangement from happening.</p>
<p>&#x201C;Many self-driving car companies have no revenue, and the operating costs are unusually high. Autonomous vehicle start-ups spend &#x0024;1.6 million a month on average&#x2014;four times the rate at financial tech or health care companies&#x201D; [<xref ref-type="bibr" rid="ref-391">391</xref>].</p>
<p>&#x201C;Companies like Uber and Lyft, worried about blowing through their cash in pursuit of autonomous technology, have tapped out. Only the deepest-pocketed outfits like Waymo, which is a subsidiary of Google&#x2019;s parent company, Alphabet; auto giants; and a handful of start-ups are managing to stay in the game.</p>
<p>&#x201C;Late last month, Lyft sold its autonomous vehicle unit to a Toyota subsidiary, Woven Planet, in a deal valued at &#x0024;550 million. Uber offloaded its autonomous vehicle unit to another competitor in December. And three prominent self-driving start-ups have sold themselves to companies with much bigger budgets over the past year&#x201D; [<xref ref-type="bibr" rid="ref-392">392</xref>].</p>

<fig id="fig-146">
<label>Figure 146</label>
<caption><title><italic>Tesla crash</italic> (Section <xref ref-type="sec" rid="s14_1">14.1</xref>). After hitting the Tesla, the blue car &#x201C;spun across the highway [<italic>Left</italic>] and onto the far shoulder [<italic>Right</italic>],&#x201D; as another car was coming toward on the right lane (left in photo), but still at a safe distance so not to hit it. [<xref ref-type="bibr" rid="ref-382">382</xref>]. See also Figures <xref ref-type="fig" rid="fig-143">143</xref>, <xref ref-type="fig" rid="fig-144">144</xref>, <xref ref-type="fig" rid="fig-145">145</xref>. (Data and video provided by QuantivRisk.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-146.tif"/>
</fig>
<p>Similar problems exist with building autonomous boats to ply the oceans without a need for a crew on board [<xref ref-type="bibr" rid="ref-393">393</xref>]:</p>
<disp-quote><p>&#x201C;When compared with autonomous cars, ships have the advantage of not having to make split-second decisions in order to avoid catastrophe. The open ocean is also free of jaywalking pedestrians, stoplights and lane boundaries. That said, robot ships share some of the problems that have bedeviled autonomous vehicles on land, namely, that they&#x2019;re bad at anticipating what humans will do, and have limited ability to communicate with them.</p>
</disp-quote><p>Shipping is a dangerous profession, as there were some 41 large ships lost at sea due to fires, rogue waves, or other accidents, in 2019 alone. But before an autonomous ship can reach the ocean, it must get out of port, and that remains a technical hurdle not yet overcome:</p>
<disp-quote><p>&#x201C; &#x2019;Technically, it&#x2019;s not possible yet to make an autonomous ship that operates safely and efficiently in crowded areas and in port areas,&#x2019; says Rudy Negenborn, a professor at TU Delft who researches and designs systems for autonomous shipping.</p>
<p>Makers of autonomous ships handle these problems by giving humans remote control. But what happens when the connection is lost? Satisfactory solutions to these problems have yet to arrive, adds Dr. Negenborn.&#x201D;</p>
</disp-quote><p>The onboard deep-learning computer vision system was trained to recognize &#x201C;kayaks, canoes, Sea-Doos&#x201D;, but a person standing on a paddle board would look like someone walking on water to the system [<xref ref-type="bibr" rid="ref-393">393</xref>]. See also Figures <xref ref-type="fig" rid="fig-143">143</xref>, <xref ref-type="fig" rid="fig-144">144</xref>, <xref ref-type="fig" rid="fig-145">145</xref>, <xref ref-type="fig" rid="fig-146">146</xref> on the failure of the Tesla computer vision system in detecting a parked truck on the side of a highway.</p>
<p>Beyond the possible lost of connection in a human remote-control ship, mechanical failure did occur, such as that happened for the Mayflower autonomous ship shown in Figure <xref ref-type="fig" rid="fig-147">147</xref> [<xref ref-type="bibr" rid="ref-394">394</xref>]. Measures would have to be taken when mechanical failure happens to a crewless ship in the middle of a vast ocean.</p>
<p>See also the interview of S.J. Russell in [<xref ref-type="bibr" rid="ref-79">79</xref>] on the need to develop hybrid systems that have classical AI along side with deep learning, which has limitations, even though it is good at classification and perception,<xref ref-type="fn" rid="fn332"><sup>332</sup></xref><fn id="fn332"><label>332</label><p>S.J. Russell also appeared in the <xref ref-type="sec" rid="s14_8">video</xref> &#x201C;AI is making it easier...&#x201D; mentioned at the end of this closure section.</p></fn> and Section <xref ref-type="sec" rid="s14_3">14.3</xref> on the barrier of meaning in AI.</p>
<fig id="fig-147">
<label>Figure 147</label>
<caption><title><italic>Mayflower autonomous ship</italic> (Section <xref ref-type="sec" rid="s14_1">14.1</xref>) sailing from Plymouth, UK, planning to arrive at Plymouth, MA, U.S., like the original Mayflower 400 years ago, but instead arriving at Halifax, Nova Scotia, Canada, on 2022 Jun 05, due to mechanical problems [<xref ref-type="bibr" rid="ref-394">394</xref>]. (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by-sa/4.0/deed.en.tif">CC BY-SA 4.0</ext-link>, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://commons.wikimedia.org/w/index.php?title=File:Mayflower_Autonomous_Ship_inside_Plymouth_Sound.jpg&amp;oldid=675176688.tif">version 16:43, 17 July 2022</ext-link>.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-147.tif"/>
</fig>
</sec>
<sec id="s14_2"><label>14.2</label>
<title>Lack of understanding on why deep learning worked</title>
<p>Such lack of understanding is described in <italic>The Guardian</italic>&#x2019;s Editorial on the 2019 New Year Day [<xref ref-type="bibr" rid="ref-67">67</xref>] as follows:</p>
<disp-quote><p>&#x201C;Compared with conventional computer programs, [AI that teaches itself] acts for reasons incomprehensible to the outside world. It can be trained, as a parrot can, by rewarding the desired behaviour; in fact, this describes the whole of its learning process. But it can&#x2019;t be consciously designed in all its details, in the way that a passenger jet can be. If an airliner crashes, it is in theory possible to reconstruct all the little steps that led to the catastrophe and to understand why each one happened, and how each led to the next. Conventional computer programs can be debugged that way. This is true even when they interact in baroquely complicated ways. But neural networks, the kind of software used in almost everything we call AI, can&#x2019;t even in principle be debugged that way. We know they work, and can by training encourage them to work better. But in their natural state it is quite impossible to reconstruct the process by which they reach their (largely correct) conclusions.&#x201D;</p>
</disp-quote>
<p>The 2021 breakthough in computer science, as declared by the Quanta Magazine [<xref ref-type="bibr" rid="ref-233">233</xref>], was the discovery of the connection between shallow networks with infinite width (Figure <xref ref-type="fig" rid="fig-148">148</xref>) and kernel machines (or methods) as a first step in trying to understand how deep-learning networks work; see Section <xref ref-type="sec" rid="s8">8</xref> on &#x201C;Kernel machines&#x201D; and Footnote <xref ref-type="fn" rid="fn31">31</xref>.</p>
</sec>
<sec id="s14_3"><label>14.3</label>
<title>Barrier of meaning</title>
<p>Deep learning could not think like humans do, and could be easily fooled as reported in [<xref ref-type="bibr" rid="ref-395">395</xref>]:</p>
<disp-quote><p>&#x201C; Machine learning algorithms don&#x2019;t yet understand things the way humans do&#x2014;with sometimes disastrous consequences.</p>
<p>Even more worrisome are recent demonstrations of the vulnerability of A.I. systems to so-called adversarial examples. In these, a malevolent hacker can make specific changes to images, sound waves or text documents that while imperceptible or irrelevant to humans will cause a program to make potentially catastrophic errors.</p>

<fig id="fig-148">
<label>Figure 148</label>
<caption><title><italic>Network with infinite width</italic> (left) and Gaussian distribution (Right) (Section <xref ref-type="sec" rid="s6_1">6.1</xref>, <xref ref-type="sec" rid="s14_2">14.2</xref>). &#x201C;A number of recent results have shown that DNNs that are allowed to become infinitely wide converge to another, simpler, class of models called Gaussian processes. In this limit, complicated phenomena (like Bayesian inference or gradient descent dynamics of a convolutional neural network) boil down to simple linear algebra equations. Insights from these infinitely wide networks frequently carry over to their finite counterparts. As such, infinite-width networks can be used as a lens to study deep learning, but also as useful models in their own right&#x201D; [<xref ref-type="bibr" rid="ref-279">279</xref>] [<xref ref-type="bibr" rid="ref-231">231</xref>]. See Figures <xref ref-type="fig" rid="fig-60">60</xref> and <xref ref-type="fig" rid="fig-61">61</xref> for the motivation for networks with infinite width. (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by-sa/4.0/deed.en.tif">CC BY-SA 4.0</ext-link>, Wikipedia, <ext-link ext-link-type="uri" xlink:href="https://commons.wikimedia.org/w/index.php?title=File:Infinitely_wide_neural_network.webm&amp;oldid=665988221.tif">version 03:51, 18 June 2022</ext-link>.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-148.tif"/>
</fig>
<p>The possibility of such attacks has been demonstrated in nearly every application domain of A.I., including computer vision, medical image processing, speech recognition and language processing. Numerous studies have demonstrated the ease with which hackers could, in principle, fool face- and object-recognition systems with specific minuscule changes to images, put inconspicuous stickers on a stop sign to make a self-driving car&#x2019;s vision system mistake it for a yield sign or modify an audio signal so that it sounds like background music to a human but instructs a Siri or Alexa system to perform a silent command.</p>
<p>These potential vulnerabilities illustrate the ways in which current progress in A.I. is stymied by the barrier of meaning. Anyone who works with A.I. systems knows that behind the facade of humanlike visual abilities, linguistic fluency and game-playing prowess, these programs do not&#x2014;in any humanlike way&#x2014;understand the inputs they process or the outputs they produce. The lack of such understanding renders these programs susceptible to unexpected errors and undetectable attacks.</p>
<p>As the A.I. researcher Pedro Domingos noted in his book <italic>The Master Algorithm</italic>, &#x2018;People worry that computers will get too smart and take over the world, but the real problem is that they&#x2019;re too stupid and they&#x2019;ve already taken over the world.&#x2019;</p>
</disp-quote><p>Such barrier of meaning is also a barrier for AI to tackle human controversies; see Section <xref ref-type="sec" rid="s14_5">14.5</xref>. See also Section <xref ref-type="sec" rid="s14_1">14.1</xref> on driverless cars not coming any time soon, which is related to the above barrier of meaning.</p>
</sec>
<sec id="s14_4"><label>14.4</label>
<title>Threat to democracy and privacy</title>
<p>On the 2019 new-year day, <italic>The Guardian</italic> [<xref ref-type="bibr" rid="ref-67">67</xref>] not only reported the most recent breakthrough in AI research on the development of AlphaZero, a software possessing superhuman performance in several &#x201C;immensely complex&#x201D; games such as Go (see Section <xref ref-type="sec" rid="s13_5">13.5</xref> on resurgence of AI and current state), they also reported another breakthrough as a more ominous warning on a &#x201C;Power struggle&#x201D; to preserve liberal democracies against authoritarian governments and criminals:</p>
<disp-quote><p>&#x201C;The second great development of the last year makes bad outcomes much more likely. This is the much wider availability of powerful software and hardware. Although vast quantities of data and computing power are needed to train most neural nets, once trained a net can run on very cheap and simple hardware. This is often called the democratisation of technology but it is really the anarchisation of it. Democracies have means of enforcing decisions; anarchies have no means even of making them. The spread of these powers to authoritarian governments on the one hand and criminal networks on the other poses a double challenge to liberal democracies. Technology grants us new and almost unimaginable powers but at the same time it takes away some powers, and perhaps some understanding too, that we thought we would always possess.&#x201D;</p>
</disp-quote><p>Nearly three years later, a report of a national poll of 2,200 adults in the U.S., released on 2021.11.15, indicated that three in four adults were concerned about the loss of privacy, &#x201C;loss of trust in elections (57%), in threats to democracy (52%), and in loss of trust in institutions (56%). Additionally, 58% of respondents say it has contributed to the spread of misinformation&#x201D; [<xref ref-type="bibr" rid="ref-396">396</xref>].</p>

<fig id="fig-149">
<label>Figure 149</label>
<caption><title><italic>Deepfake images</italic> (Section <xref ref-type="sec" rid="s14_4_1">14.4.1</xref>). AI-generated portraits using Generative Adversarial Network (GAN) models. See also [<xref ref-type="bibr" rid="ref-397">397</xref>] [<xref ref-type="bibr" rid="ref-398">398</xref>], Chap. 8, &#x201C;GAN Fingerprints in Face Image Synthesis.&#x201D; (Images from <ext-link ext-link-type="uri" xlink:href="https://thispersondoesnotexist.com/">&#x2018;This Person Does Not Exist&#x2019;</ext-link> site.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-149.tif"/>
</fig>
<sec id="s14_4_1"><label>14.4.1</label>
<title>Deepfakes</title>
<p>AI software available online helping to create videos that show someone said or did things that the person did not say or do represent a clear danger to democracy, as these deepfake videos could affect the outcome of an election, among other misdeeds, with risk to national security. Advances in machine learning have made deepfakes &#x201C;ever more realistic and increasingly resistant to detection&#x201D; [<xref ref-type="bibr" rid="ref-399">399</xref>]; see Figure <xref ref-type="fig" rid="fig-149">149</xref>. The authors of [<xref ref-type="bibr" rid="ref-400">400</xref>] concurred:</p>
<disp-quote><p>&#x201C;Deepfake videos made with artificial intelligence can be a powerful force because they make it appear that someone did or said something that they never did, altering how the viewers see politicians, corporate executives, celebrities and other public figures. The tools necessary to make these videos are available online, with some people making celebrity mashups and one app offering to insert users&#x2019; faces into famous movie scenes.&#x201D;</p>
</disp-quote><p>To be sure, deepfakes do have benefits in education, arts, and individual autonomy [<xref ref-type="bibr" rid="ref-399">399</xref>]. In education, deepfakes could be used to provide information to students in a more interesting manner. For example, deepfakes make it possible to &#x201C;manufacture videos of historical figures speaking directly to students, giving an otherwise unappealing lecture a new lease on life&#x201D;. In the arts, deepfake technology allowed to resurrect long dead actors for fresh roles in new movies. An example is a recent <italic>Star Wars</italic> movie with the deceased actress Carrie Fisher. In helping to maintain some personal autonomy, deepfake audio technology could help restore the ability to speak for a person suffered from some form of paralysis that prevents normal speaking.</p>
<p>On the other hand, the authors of [<xref ref-type="bibr" rid="ref-399">399</xref>] cited a long list of harmful uses of deepfakes, from harm to individuals or organizations (e.g., exploitation, sabotage), to harm to society (e.g., distortion of democratic discourse, manipulation of elections, eroding trust in institutions, exacerbating social division, undermining public safety, undermining diplomacy, jeopardizing national security, undermining journalism, crying deepfake news as liar&#x2019;s dividend).<xref ref-type="fn" rid="fn333"><sup>333</sup></xref><fn id="fn333"><label>333</label><p>Watch also Danielle Citron&#x2019;s 2019 TED talk &#x201C;How deepfakes undermine truth and threaten democracy&#x201D; [<xref ref-type="bibr" rid="ref-401">401</xref>].</p></fn> See also [<xref ref-type="bibr" rid="ref-402">402</xref>] [<xref ref-type="bibr" rid="ref-403">403</xref>] [<xref ref-type="bibr" rid="ref-404">404</xref>] [<xref ref-type="bibr" rid="ref-405">405</xref>].</p>
<p>Researchers have been in a race to develop methods to detect deepfakes, a difficult technological challenge [<xref ref-type="bibr" rid="ref-406">406</xref>]. One method is to spot the subtle characteristics of how someone spoke to provide a basis to determine whether a video was true or fake [<xref ref-type="bibr" rid="ref-400">400</xref>]. But that method was not a top-five winner of the <italic>DeepFake Detection Challenge</italic> (DFDC) [<xref ref-type="bibr" rid="ref-407">407</xref>] organized in the period 2019-2020 by &#x201C;The Partnership for AI, in collaboration with large companies including Facebook, Microsoft, and Amazon,&#x201D; with a total prize money of one million dollars, divided among the top five winners, out of more than two thousand teams [<xref ref-type="bibr" rid="ref-408">408</xref>].</p>
<p>Human&#x2019;s ability to detect of deepfakes compared well with the &#x201C;leading model,&#x201D; i.e., the DFCD top winner [<xref ref-type="bibr" rid="ref-408">408</xref>]. The results were &#x201C;at odds with the commonly held view in media forensics that ordinary people have extremely limited ability to detect media manipulations&#x201D; [<xref ref-type="bibr" rid="ref-408">408</xref>]; see Figure <xref ref-type="fig" rid="fig-150">150</xref>, where the width of a violin plot,<xref ref-type="fn" rid="fn334"><sup>334</sup></xref><fn id="fn334"><label>334</label><p>See the classic original paper [<xref ref-type="bibr" rid="ref-409">409</xref>], which was cited 1,554 times on Google Scholar as of 2022.08.24. See also [<xref ref-type="bibr" rid="ref-410">410</xref>] with Python code and resulting images on <ext-link ext-link-type="uri" xlink:href="https://github.com/erykml/medium_articles/blob/master/Statistics/violin_plots.ipynb">GitHub</ext-link>.</p></fn> at a given accuracy, represents the number of participants. In Col. 2 of Figure <xref ref-type="fig" rid="fig-150">150</xref>, the area of the blue violin <italic>above</italic> the leading model accuracy of 65% represents 82% of the participants, represented by the area of the whole violin. A crowd does have a collective accuracy comparable to (or for those who viewed at least 10 videos, better than) the leading model; see Cols. 5, 6, 7 in Figure <xref ref-type="fig" rid="fig-150">150</xref>.</p>
<p>While it is difficult to detect AI deepfakes, the MIT Media Lab DeepFake detection project advised to pay attention to the following eight facial features [<xref ref-type="bibr" rid="ref-411">411</xref>]:</p>
<list list-type="simple">
<list-item><label>(1)</label><p>&#x201C;Face. High-end DeepFake manipulations are almost always facial transformations.</p></list-item>
<list-item><label>(2)</label><p>&#x201C;Cheeks and forehead. Does the skin appear too smooth or too wrinkly? Is the agedness of the skin similar to the agedness of the hair and eyes? DeepFakes are often incongruent on some dimensions.</p></list-item>
<list-item><label>(3)</label><p>&#x201C;Eyes and eyebrows. Do shadows appear in places that you would expect? DeepFakes often fail to fully represent the natural physics of a scene.</p></list-item>
<list-item><label>(4)</label><p>&#x201C;Glasses. Is there any glare? Is there too much glare? Does the angle of the glare change when the person moves? Once again, DeepFakes often fail to fully represent the natural physics of lighting.</p></list-item>
<list-item><label>(5)</label><p>&#x201C;Facial hair or lack thereof. Does this facial hair look real? DeepFakes might add or remove a mustache, sideburns, or beard. But, DeepFakes often fail to make facial hair transformations fully natural.</p></list-item>
<list-item><label>(6)</label><p>&#x201C;Facial moles. Does the mole look real?</p></list-item>
<list-item><label>(7)</label><p>Eye &#x201C;blinking. Does the person blink enough or too much?</p></list-item>
<list-item><label>(8)</label><p>&#x201C;Size and color of the lips. Does the size and color match the rest of the person&#x2019;s face?&#x201D;</p></list-item></list>

<fig id="fig-150">
<label>Figure 150</label>
<caption><title><italic>DeepFake detection</italic> (Section <xref ref-type="sec" rid="s14_4_1">14.4.1</xref>). Violin plots. <inline-formula id="ieqn-2047"><mml:math id="mml-ieqn-2047"><mml:mi>&#x2022;</mml:mi></mml:math></inline-formula> <italic>Individual vs machine</italic>. The leading model had an accuracy of 65% on 4,000 videos (Col. 1). In Experiment 1 (E1), 5,524 participants were asked to identify a deepfake from each of 56 pairs of videos. The participants had a mean accuracy of 80% (white dot in Col. 2), with 82% of the participants having an accuracy better than that of the leading model (65%). In Experiment 2 (E2), using a subset of randomly sampled videos, the recruited (R) participants had mean accuracy at 66% (Col. 3), the non-recruited (NR) participants at 69% (Col. 4), and leading model at 80%. <inline-formula id="ieqn-2048"><mml:math id="mml-ieqn-2048"><mml:mi>&#x2022;</mml:mi></mml:math></inline-formula> <italic>Crowd wisdom vs machine</italic>. Crowd mean is the average accuracy by participants for each video. R participants had a crowd-mean average accuracy at 74%, NR participants at 80%, which was the same for the leading model, and NR participants who viewed at least 10 videos at 86% [<xref ref-type="bibr" rid="ref-408">408</xref>]. <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by-nc-nd/4.0/.tif">(CC BY-NC-ND 4.0)</ext-link></title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-150.tif"/>
</fig>
</sec>
<sec id="s14_4_2"><label>14.4.2</label>
<title>Facial recognition nightmare</title>
<p>&#x201C;We&#x2019;re all screwed&#x201D; as a Clearview AI, startup company, uses deep learning to identify faces against a large database involving more than three billions photos collected from &#x201C;Facebook, Youtube, Venmo and millions of other websites&#x201D; [<xref ref-type="bibr" rid="ref-412">412</xref>]. Their software &#x201C;could end your ability to walk down the street anonymously, and provided it to hundreds of law enforcement agencies&#x201D;. More than 600 law enforcement agencies have started to use Clearview AI software to &#x201C;help solve shoplifting, identity theft, credit card fraud, murder and child sexual exploitation cases&#x201D;. On the other hand, the tool could be abused, such as identifying &#x201C;activists at a protest or an attractive stranger on the subway, revealing not just their names but where they lived, what they did and whom they knew&#x201D;. Some large cities such as San Francisco has banned to use of facial recognition by the police.</p>
<p>A breach of Clearview AI database occurred just a few weeks after the article by [<xref ref-type="bibr" rid="ref-412">412</xref>], an unforeseen, but not surprising, event [<xref ref-type="bibr" rid="ref-413">413</xref>]:</p>
<disp-quote><p>&#x201C;Clearview AI, the controversial and secretive facial recognition company, recently experienced its first major data breach&#x2014;a scary prospect considering the sheer amount and scope of personal information in its database, as well as the fact that access to it is supposed to be restricted to law enforcement agencies.&#x201D;</p>
</disp-quote><p>The leaked documents showed that Clearview AI had a large range of customers, ranging from law-enforcement agencies (both domestic and internatinal), to large retail stores (Macy&#x2019;s, Best Buy, Walmart). Experts describe Clearview AI&#x2019;s plan to produce a publicly available face recognition app as &#x201C;dangerous&#x201D;. So we got screwed again.</p>
<p>There was a documented wrongful arrest by face-recognition algorithm that demonstrated racism, i.e., a bias toward people of color [<xref ref-type="bibr" rid="ref-414">414</xref>]. A detective showed the wrongful-arrest victim a photo that was clearly not the victim, and asked &#x201C;Is this you?&#x201D; to which the victim replied &#x201C;You think all black men look alike?&#x201D;</p>
<p>It is well known that AI has &#x201C;propensity to replicate, reinforce or amplify harmful existing social biases&#x201D; [<xref ref-type="bibr" rid="ref-415">415</xref>], such as racial bias [<xref ref-type="bibr" rid="ref-416">416</xref>] among others: &#x201C;An early example arose in 2015, when a software engineer pointed out that Google&#x2019;s image-recognition system had labeled his Black friends as &#x2018;gorillas.&#x2019; Another example arose when Joy Buolamwini, an algorithmic fairness researcher at MIT, tried facial recognition on herself&#x2014;and found that it wouldn&#x2019;t recognize her, a Black woman, until she put a white mask over her face. These examples highlighted facial recognition&#x2019;s failure to achieve another type of fairness: representational fairness&#x201D; [<xref ref-type="bibr" rid="ref-417">417</xref>].<xref ref-type="fn" rid="fn335"><sup>335</sup></xref><fn id="fn335"><label>335</label><p>See also [<xref ref-type="bibr" rid="ref-418">418</xref>] on a number of relevant AI ethical issues such as: &#x201C;Who bears responsibility in the event of harm resulting from the use of an AI system; How can AI systems be prevented from reflecting existing discrimination, biases and social injustices based on their training data, thereby exacerbating them; How can the privacy of people be protected, given that personal data can be collected and analysed so easily by many.&#x201D; Perhaps the toughest question is &#x201C;Who should get to decide which moral intuitions, which values, should be embedded in algorithms?&#x201D; [<xref ref-type="bibr" rid="ref-417">417</xref>].</p></fn></p>
<p>A legal measure has been taken against gathering data for facial-recognition software. In May 2022, Clearview AI was slapped with a &#x201C;&#x0024;10 million for scraping UK faces from the web. That might not be the end of it&#x201D;; in addition, &#x201C;the firm was also ordered to delete all of the data it holds on UK citizens&#x201D; [<xref ref-type="bibr" rid="ref-419">419</xref>].</p>
<p>There were more of such measures: &#x201C;Earlier this year, Italian data protection authorities fined Clearview AI &#x20AC;20 million (&#x0024;21 million) for breaching data protection rules. Authorities in Australia, Canada, France, and Germany have reached similar conclusions.</p>
<p>Even in the US, which does not have a federal data protection law, Clearview AI is facing increasing scrutiny. Earlier this month the ACLU won a major settlement that restricts Clearview from selling its database across the US to most businesses. In the state of Illinois, which has a law on biometric data, Clearview AI cannot sell access to its database to anyone, even the police, for five years&#x201D; [<xref ref-type="bibr" rid="ref-419">419</xref>].</p>
</sec> </sec>
<sec id="s14_5"><label>14.5</label>
<title>AI cannot tackle controversial human problems</title>
<p>If there was a barrier of meaning as described in Section <xref ref-type="sec" rid="s14_3">14.3</xref>, it is clear that there are many problems that AI could not be trained to solve since even humans do not agree on how to classify certain activities as offending or acceptable. It was written in [<xref ref-type="bibr" rid="ref-420">420</xref>] the following:</p>
<disp-quote><p>&#x201C;Mr. Schroepfer&#x2014;or Schrep, as he is known internally&#x2014;is the person at Facebook leading the efforts to build the automated tools to sort through and erase the millions of [hate-speech] posts. But the task is Sisyphean, he acknowledged over the course of three interviews recently.</p>
<p>That&#x2019;s because every time Mr. Schroepfer [Facebook&#x2019;s Chief Technology Officer] and his more than 150 engineering specialists create A.I. solutions that flag and squelch noxious material, new and dubious posts that the A.I. systems have never seen before pop up&#x2014;and are thus not caught. The task is made more difficult because &#x201C;bad activity&#x201D; is often in the eye of the beholder and humans, let alone machines, cannot agree on what that is.</p>
<p>&#x201C;I don&#x2019;t think I&#x2019;m speaking out of turn to say that I&#x2019;ve seen Schrep cry at work,&#x201D; said Jocelyn Goldfein, a venture capitalist at Zetta Venture Partners who worked with him at Facebook.&#x201D;</p>
</disp-quote> </sec>
<sec id="s14_6"><label>14.6</label>
<title>So what&#x2019;s new? Learning to think like babies</title>
<p>Because of AI&#x2019;s inability to understand (barrier of meaning) and to solve controversial human issues, a idea to tackle such problems is to start with baby steps in trying to teach AI to think like babies, as recounted by [<xref ref-type="bibr" rid="ref-321">321</xref>]:</p>
<disp-quote><p>&#x201C;The problem is that these new algorithms are beginning to bump up against significant limitations. They need enormous amounts of data, only some kinds of data will do, and they&#x2019;re not very good at generalizing from that data. Babies seem to learn much more general and powerful kinds of knowledge than AIs do, from much less and much messier data. In fact, human babies are the best learners in the universe. How do they do it? And could we get an AI to do the same?</p>
<p>First, there&#x2019;s the issue of data. AIs need enormous amounts of it; they have to be trained on hundreds of millions of images or games.</p>
<p>Children, on the other hand, can learn new categories from just a small number of examples. A few storybook pictures can teach them not only about cats and dogs but jaguars and rhinos and unicorns.</p>
<p>AIs also need what computer scientists call &#x201C;supervision.&#x201D; In order to learn, they must be given a label for each image they &#x201C;see&#x201D; or a score for each move in a game. Baby data, by contrast, is largely unsupervised.</p>
<p>Even with a lot of supervised data, AIs can&#x2019;t make the same kinds of generalizations that human children can. Their knowledge is much narrower and more limited, and they are easily fooled by what are called &#x201C;adversarial examples.&#x201D; For instance, an AI image recognition system will confidently say that a mixed-up jumble of pixels is a dog if the jumble happens to fit the right statistical pattern&#x2014;a mistake a baby would never make.&#x201D;</p>
</disp-quote><p>Regarding early stopping and generalization error in network training, see Remark <xref ref-type="statement" rid="st6_1">6.1</xref> in Section <xref ref-type="sec" rid="s6_1">6.1</xref>. To make AIs into more robust and resilient learners, researchers are developing methods to build curiosity into AIs, instead of focusing on immediate rewards.</p>
<fig id="fig-151">
<label>Figure 151</label>
<caption><title><italic>Lack of transparency and irreproducibility</italic> (Section <xref ref-type="sec" rid="s14_7">14.7</xref>). The table shows many missing pieces of information for the three networks&#x2014;Lesion, Breast, and Case models&#x2014;used to detect breast cancer. Learning rate, Section <xref ref-type="sec" rid="s6_2">6.2</xref>. Learning-rate schedule, Section <xref ref-type="sec" rid="s6_3_1">6.3.1</xref>, Figure <xref ref-type="fig" rid="fig-65">65</xref> in Section <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>. SGD with momentum, Section <xref ref-type="sec" rid="s6_3_2">6.3.2</xref> and Remark <xref ref-type="statement" rid="st6_5">6.5</xref>. Adam algorithm, Section <xref ref-type="sec" rid="s6_5_6">6.5.6</xref>. Batch size, Sections <xref ref-type="sec" rid="s6_3_1">6.3.1</xref> and <xref ref-type="sec" rid="s6_3_5">6.3.5</xref>. Epoch, Footnote <xref ref-type="fn" rid="fn145">145</xref>. [<xref ref-type="bibr" rid="ref-421">421</xref>]. (Figure reproduced with permission of the authors).</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-151.tif"/>
</fig>
</sec>
<sec id="s14_7"><label>14.7</label>
<title>Lack of transparency and irreproducibility of results</title>
<p>For &#x201C;multiple years now&#x201D;, there have been articles on deep learning that looked more like a promotion/advertisement for newly developed AI technologies, rather than scientific papers in the traditional sense that published results should be replicable and verifiable [<xref ref-type="bibr" rid="ref-422">422</xref>]. But it was only on 2020 Oct 14 that many scientists [<xref ref-type="bibr" rid="ref-421">421</xref>] had enough and protested the lack of transparency in AI research in a &#x201C;damning&#x201D; article in <italic>Nature</italic>, a major scientific journal.</p>
<disp-quote><p>&#x201C;We couldn&#x2019;t take it anymore,&#x201D; says Benjamin Haibe-Kains, the lead author of the response, who studies computational genomics at the University of Toronto. &#x201C;It&#x2019;s not about this study in particular&#x2014;it&#x2019;s a trend we&#x2019;ve been witnessing for multiple years now that has started to really bother us.&#x201D; [<xref ref-type="bibr" rid="ref-422">422</xref>]</p>
</disp-quote><p>The particular contentious study was published by the Google-Health authors of [<xref ref-type="bibr" rid="ref-423">423</xref>] on the use of AI in medical imaging to detect breast cancer. But these authors of [<xref ref-type="bibr" rid="ref-423">423</xref>] provided so little information about their code and how it was tested that their article read more like a &#x201C;promotion of proprietary tech&#x201D; than a scientific paper. Figure <xref ref-type="fig" rid="fig-151">151</xref> shows the missing pieces of crucial information to reproduce the results. A question would immediately come to mind: Why would a reputable journal like <italic>Nature</italic> accept such a paper? Was the review rigorous enough?</p>

<disp-quote><p>&#x201C;When we saw that paper from Google, we realized that it was yet another example of a very high-profile journal publishing a very exciting study that has nothing to do with science,&#x201D; Haibe-Kains says. &#x201C;It&#x2019;s more an advertisement for cool technology. We can&#x2019;t really do anything with it.&#x201D; [<xref ref-type="bibr" rid="ref-422">422</xref>]</p>
</disp-quote><p>According to [<xref ref-type="bibr" rid="ref-421">421</xref>], even though McKinney et al [<xref ref-type="bibr" rid="ref-423">423</xref>] stated that &#x201C;all experiments and implementation details were described in sufficient detail in the supplementary methods section of their Article to &#x2018;support replication with non-proprietary libraries&#x2019;,&#x201D; that was a subjective statement, and replicating their results would be a difficult task, since such textual description can hide a high level of complexity of the code, and nuances in the computer code can have large effects in the training and evaluation results.</p>
<disp-quote><p>&#x201C;AI is feeling the heat for several reasons. For a start, it is a newcomer. It has only really become an experimental science in the past decade, says Joelle Pineau, a computer scientist at Facebook AI Research and McGill University, who coauthored the complaint. &#x2018;It used to be theoretical, but more and more we are running experiments,&#x2019; she says. &#x2018;And our dedication to sound methodology is lagging behind the ambition of our experiments.&#x2019; &#x201D; [<xref ref-type="bibr" rid="ref-422">422</xref>]</p>
</disp-quote><p>No progress in science could be made if results were not verifiable and replicable by independent researchers.</p>
</sec>
<sec id="s14_8"><label>14.8</label>
<title>Killing you!</title>
<p>Oh, one more thing: &#x201C;A.I. Is Making it Easier to Kill (You). Here&#x2019;s How,&#x201D; New York Times Documentaries, 2019.12.13 (<ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/video/technology/100000006082083/lethal-autonomous-weapons.html">Original website</ext-link>) (<ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=GFD_Cgr2zho">Youtube</ext-link>).</p>
<p>And getting better at it every day, e.g., by using &#x201C;a suite of artificial intelligence-driven systems that will be able to control networked &#x2018;loyal wingman&#x2019; type drones and fully autonomous unmanned combat air vehicles&#x201D; [<xref ref-type="bibr" rid="ref-424">424</xref>]; see also &#x201C;Collaborative Operations in Denied Environment (CODE) Phase 2 Concept Video&#x201D; (<ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=2cWa7hCAwkk">Youtube</ext-link>).</p>
</sec>
</sec>
</body>
<back>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>1.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Rosenblatt</surname>, <given-names>F.</given-names></string-name></person-group> (<year>1957</year>). <article-title>The perceptron: A perceiving and recognizing automaton</article-title>. <comment>Technical report, Cornell University. Cornell University, Report No. 85-460-1. Project PARA, January</comment>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190331135126/https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf">Internet archive</ext-link>. <comment>1070, 1079, 1123, 1279, 1339</comment></mixed-citation></ref>
<ref id="ref-2"><label>2.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Rosenblatt</surname>, <given-names>F.</given-names></string-name></person-group> (<year>1962</year>). <source>Principles of neurodynamics: Perceptrons and the theory of brain mechanisms</source>. <publisher-name>Spartan Books</publisher-name>. <comment>1070, 1079, 1114, 1123, 1278, 1280, 1281, 1282, 1283, 1339</comment></mixed-citation></ref>
<ref id="ref-3"><label>3.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Polyak</surname>, <given-names>B.</given-names></string-name></person-group> (<year>1964</year>). <article-title>Some methods of speeding up the convergence of iteration methods</article-title>. <source>USSR Computational Mathematics and Mathematical Physics</source>, <volume>4</volume>(<issue>5</issue>), <fpage>1</fpage>&#x2013;<lpage>17</lpage>. DOI <pub-id pub-id-type="doi">10.1016/0041-5553(64)90137-5</pub-id>. <comment>1070, 1078, 1079, 1153, 1157, 1158, 1159</comment></mixed-citation></ref>
<ref id="ref-4"><label>4.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Roose</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2022</year>). <article-title>An A.I.-Generated PictureWon an Art Prize. Artists Aren&#x2019;t Happy</article-title>. <source>New York Times</source>, (<italic>Sep 2</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html">Original website</ext-link>. <comment>1074, 1075</comment></mixed-citation></ref>
<ref id="ref-5"><label>5.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jumper</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Evans</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Pritzel</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Green</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Figurnov</surname>, <given-names>M.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Highly accurate protein structure prediction with AlphaFold</article-title>. <source>Nature</source>, <volume>596</volume>(<issue>7873</issue>), <fpage>583</fpage>&#x2013;<lpage>589</lpage>. <comment>1076</comment>; <pub-id pub-id-type="pmid">34265844</pub-id></mixed-citation></ref>
<ref id="ref-6"><label>6.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Silver</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Maddison</surname>, <given-names>C. J.</given-names></string-name>, <string-name><surname>Guez</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sifre</surname>, <given-names>L.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2016</year>). <article-title>Mastering the game of Go with deep neural networks and tree search</article-title>. <source>Nature</source>, <volume>529</volume>(<issue>7587</issue>), <fpage>484</fpage>+. <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/nature16961">Original website</ext-link>. <comment>1076, 1080, 1081</comment>; <pub-id pub-id-type="pmid">26819042</pub-id></mixed-citation></ref>
<ref id="ref-7"><label>7.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Moyer</surname>, <given-names>C.</given-names></string-name></person-group> <article-title>How Google&#x2019;s AlphaGo Beat a GoWorld Champion</article-title>. <year>2016</year> <month>Mar</month> <day>28</day>, <ext-link ext-link-type="uri" xlink:href="https://www.theatlantic.com/technology/archive/2016/03/the-invisible-opponent/475611/">Original website</ext-link>. <comment>1076</comment></mixed-citation></ref>
<ref id="ref-8"><label>8.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Edwards</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2022</year>). <article-title>DeepMind breaks 50-year math record using AI; new record falls a week later</article-title>. <source>Ars Technica</source>, (<italic>Oct 13</italic>). <ext-link ext-link-type="uri" xlink:href="https://arstechnica.com/information-technology/2022/10/deepmind-breaks-50-year-math-record-using-ai-new-record-falls-a-week-later/">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20221015153548/https://arstechnica.com/information-technology/2022/10/deepmind-breaks-50-year-math-record-using-ai-new-record-falls-a-week-later/">Internet archive</ext-link>. <comment>1074</comment></mixed-citation></ref>
<ref id="ref-9"><label>9.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Humer</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2212.08989">arXiv:2212.08989</ext-link>. <comment>1074</comment></mixed-citation></ref>
<ref id="ref-10"><label>10.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Roose</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2023</year>). <article-title>Bing (Yes, Bing) Just Made Search Interesting Again</article-title>. <source>New York Times</source>, (<italic>Feb 8</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2023/02/08/technology/microsoft-bing-openai-artificial-intelligence.html">Original website</ext-link>. <comment>1074</comment></mixed-citation></ref>
<ref id="ref-11"><label>11.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Knight</surname>, <given-names>W.</given-names></string-name></person-group> (<year>2023</year>). <article-title>Meet Bard, Google&#x2019;s Answer to ChatGPT</article-title>. <source>WIRED</source>, (<italic>Feb 6</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.wired.com/story/meet-bard-googles-answer-to-chatgpt/">Original website</ext-link>. <comment>1074</comment></mixed-citation></ref>
<ref id="ref-12"><label>12.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Schmidhuber</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Deep learning in neural networks: An overview</article-title>. <source>Neural Networks</source>, <issue>61</issue>, <fpage>87</fpage>&#x2013;<lpage>117</lpage>. <comment>1075, 1104, 1106, 1120, 1291, 1292, 1293, 1340</comment></mixed-citation></ref>
<ref id="ref-13"><label>13.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>LeCun</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source>, <volume>521</volume>(<issue>7553</issue>), <fpage>436</fpage>&#x2013;<lpage>444</lpage>. <comment>1075, 1080, 1082, 1106, 1120, 1121, 1122, 1197, 1199</comment>; <pub-id pub-id-type="pmid">26017442</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>14.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Khan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Yairi</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A review on the application of deep learning in system health management</article-title>. <source>Mechanical Systems and Signal Processing</source>, <volume>107</volume>, <fpage>241</fpage>&#x2013;<lpage>265</lpage>. <comment>1075</comment></mixed-citation></ref>
<ref id="ref-15"><label>15.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sanchez-Lengeling</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Aspuru-Guzik</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Inverse molecular design using machine learning: Generative models for matter engineering</article-title>. <source>Science</source>, <volume>361</volume>(<italic>6400, SI</italic>), <fpage>360</fpage>&#x2013;<lpage>365</lpage>. <comment>1075</comment>; <pub-id pub-id-type="pmid">30049875</pub-id></mixed-citation></ref>
<ref id="ref-16"><label>16.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ching</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Himmelstein</surname>, <given-names>D. S.</given-names></string-name>, <string-name><surname>Beaulieu-Jones</surname>, <given-names>B. K.</given-names></string-name>, <string-name><surname>Kalinin</surname>, <given-names>A. A.</given-names></string-name>, <string-name><surname>Do</surname>, <given-names>B. T.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2018</year>). <article-title>Opportunities and obstacles for deep learning in biology and medicine</article-title>. <source>Journal of the Royal Society Interface</source>, <volume>15</volume>(<issue>141</issue>). <comment>1075</comment></mixed-citation></ref>
<ref id="ref-17"><label>17.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Quinn</surname>, <given-names>J. A.</given-names></string-name>, <string-name><surname>Nyhan</surname>, <given-names>M. M.</given-names></string-name>, <string-name><surname>Navarro</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Coluccia</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Bromley</surname>, <given-names>L.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2018</year>). <article-title>Humanitarian applications of machine learning with remote-sensing data: review and case study in refugee settlement mapping</article-title>. <source>Philosophical Transactions of the Royal Society A-Mathematical Physical and Engineering Sciences</source>, <volume>376</volume>(<issue>2128</issue>). <comment>1075</comment></mixed-citation></ref>
<ref id="ref-18"><label>18.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Higham</surname>, <given-names>C. F.</given-names></string-name>, <string-name><surname>Higham</surname>, <given-names>D. J.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Deep learning: An introduction for applied mathematicians</article-title>. <source>SIAM Review</source>, <volume>61</volume>(<issue>4</issue>), <fpage>860</fpage>&#x2013;<lpage>891</lpage>. <comment>1075</comment></mixed-citation></ref>
<ref id="ref-19"><label>19.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Dayan</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Abbott</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2001</year>). <source>Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems</source>. <publisher-name>MIT Press</publisher-name>. <comment>1075, 1077, 1079, 1098, 1099, 1106, 1107, 1108, 1109, 1111, 1280, 1283, 1284, 1285, 1287</comment></mixed-citation></ref>
<ref id="ref-20"><label>20.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sze</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>Y. H.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>T. J.</given-names></string-name>, <string-name><surname>Emer</surname>, <given-names>J. S.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Efficient Processing of Deep Neural Networks: A Tutorial and Survey</article-title>. <source>Proceedings of the IEEE</source>, <volume>105</volume>(<volume>12</volume>), <fpage>2295</fpage>&#x2013;<lpage>2329</lpage>. <comment>1075, 1085, 1100, 1106, 1277</comment></mixed-citation></ref>
<ref id="ref-21"><label>21.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Nielsen</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2015</year>). <source>Neural Networks and Deep Learning</source>. <publisher-name>Determination Press</publisher-name>. <ext-link ext-link-type="uri" xlink:href="http://neuralnetworksanddeeplearning.com">Original website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190113093113/http://neuralnetworksanddeeplearning.com">Internet archive</ext-link>. <comment>1076, 1100, 1106, 1134, 1135, 1277, 1278, 1281</comment></mixed-citation></ref>
<ref id="ref-22"><label>22.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rumelhart</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Williams</surname>, <given-names>R.</given-names></string-name></person-group> (<year>1986</year>). <article-title>Learning representations by back-propagating errors</article-title>. <source>Nature</source>, <volume>323</volume>(<issue>6088</issue>), <fpage>533</fpage>&#x2013;<lpage>536</lpage>. <comment>1076, 1158, 1283, 1291, 1292, 1293, 1339</comment></mixed-citation></ref>
<ref id="ref-23"><label>23.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ghaboussi</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Garrett</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>X.</given-names></string-name></person-group> (<year>1991</year>). <article-title>Knowledge-based modeling of material behavior with neural networks</article-title>. <source>Journal of Engineering Mechanics-ASCE</source>, <volume>117</volume>(<issue>1</issue>), <fpage>132</fpage>&#x2013;<lpage>153</lpage>. <comment>1076, 1077, 1094, 1100, 1241, 1277, 1340</comment></mixed-citation></ref>
<ref id="ref-24"><label>24.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hochreiter</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Schmidhuber</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1997</year>). <article-title>Long short-term memory</article-title>. <source>Neural Computation</source>, <volume>9</volume>(<issue>8</issue>), <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>. <comment>1077, 1097, 1199</comment>; <pub-id pub-id-type="pmid">9377276</pub-id></mixed-citation></ref>
<ref id="ref-25"><label>25.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>W. C.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A multiscale multi-permeability poroplasticity model linked by recursive homogenizations and deep learning</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>334</volume>, <fpage>337</fpage>&#x2013;<lpage>380</lpage>. <comment>1077, 1079, 1090, 1092, 1093, 1094, 1095, 1096, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1252</comment></mixed-citation></ref>
<ref id="ref-26"><label>26.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Mohan</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gaitonde</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A deep learning based approach to reduced order modeling for turbulent flow control using LSTM neural networks</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1804.09269">arXiv:1804.09269</ext-link> <italic>[physics.comp-ph]</italic>. <month>Apr</month> <day>24</day>. <comment>1077, 1078, 1079, 1096, 1097, 1098, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259, 1260</comment></mixed-citation></ref>
<ref id="ref-27"><label>27.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zaman</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1998</year>). <article-title>A neural network model for a cohesionless soilIn AttohOkine, NO</article-title>. <source>Artificial Intelligence and Mathematical Methods in Pavement and Geomechanical Systems</source>. <conf-name>International Workshop on Artificial Intelligence and Mathematical Methods in Pavement and Geomechanical Systems</conf-name>, <conf-loc>Miami, FL</conf-loc>, <conf-date>Nov 05-06, 1998</conf-date>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-28"><label>28.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Su</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Fan</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Schlup</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1998</year>). <article-title>Monitoring the process of curing of epoxy/graphite fiber composites with a recurrent neural network as a soft sensor</article-title>. <source>Engineering Applications of Artificial Intelligence</source>, <volume>11</volume>(<issue>2</issue>), <fpage>293</fpage>&#x2013;<lpage>306</lpage>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-29"><label>29.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>T.</given-names></string-name></person-group> (<year>1999</year>). <article-title>Automatic structure and parameter training methods for modeling of mechanical systems by recurrent neural networks</article-title>. <source>Applied Mathematical Modelling</source>, <volume>23</volume>(<issue>12</issue>), <fpage>933</fpage>&#x2013;<lpage>944</lpage>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-30"><label>30.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Waszczyszyn</surname>, <given-names>Z.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Neural networks in structural engineering: Some recent results and prospects for applications</article-title>. <comment>In topping, bhv</comment>. <source>Computational Mechanics for the Twenty-First Century</source>. <conf-name>5th International Conference on Computational Structures Technology/2nd International Conference on Engineering Computational Technology</conf-name>, <conf-loc>Leuven, Belgium</conf-loc>, <conf-date>Sep 06-08, 2000</conf-date>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-31"><label>31.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Vaswani</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Shazeer</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Parmar</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Uszkoreit</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Jones</surname>, <given-names>L.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2017</year>). <article-title>Attention Is All You Need</article-title>. <comment>CoRR, abs/1706.03762v5</comment>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1706.03762v5">arXiv:1706.03762v5</ext-link>. <comment>See Footnote <xref ref-type="fn" rid="fn337">337</xref>. 1077, 1079, 1203, 1206, 1207, 1208, 1209, 1210, 1211, 1316</comment></mixed-citation></ref>
<ref id="ref-32"><label>32.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hahnloser</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Sarpeshkar</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Mahowald</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Douglas</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Seung</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit (vol 405, pg 947, 2000)</article-title>. <source>Nature</source>, <volume>408</volume>(<issue>6815</issue>), <fpage>1012</fpage>&#x2013;<lpage>U24</lpage>. <comment>1077, 1107, 1287, 1289, 1290</comment></mixed-citation></ref>
<ref id="ref-33"><label>33.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Jarrett</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Kavukcuoglu</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Ranzato</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>LeCun</surname>, <given-names>Y.</given-names></string-name></person-group> <article-title>What is the Best Multi-Stage Architecture for Object Recognition?</article-title> <source>2009 IEEE 12th International Conference on Computer Vision (ICCV)</source>. <comment>1077, 1107</comment></mixed-citation></ref>
<ref id="ref-34"><label>34.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Nair</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2010</year>). <article-title>Rectified linear units improve restricted boltzmann machines</article-title>. <conf-name>Proceedings of the 27th International Conference on Machine Learning</conf-name>, <conf-loc>Haifa, Israel</conf-loc>. <comment>1077, 1107</comment></mixed-citation></ref>
<ref id="ref-35"><label>35.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Little</surname>, <given-names>W.</given-names></string-name></person-group> (<year>1974</year>). <article-title>The existence of persistent states in the brain</article-title>. <source>Mathematical Biosciences</source>, <volume>19</volume>, <fpage>101</fpage>&#x2013;<lpage>120</lpage>. In <person-group person-group-type="editor"><string-name><surname>Cabrera</surname>, <given-names>B</given-names></string-name> and <string-name><surname>Gutfreund</surname>, <given-names>H</given-names></string-name> and <string-name><surname>Kresin</surname>, <given-names>V</given-names></string-name></person-group> (eds.), <conf-name>From High-Temperature Superconductivity to Microminiature Refrigeration, William Little Symposium on From High-Temperature Superconductivity to Microminiature Refrigeration</conf-name>, <conf-loc>Stanford Univ, Stanford, CA</conf-loc>, <conf-date>Sep 30, 1995</conf-date>.<comment><xref ref-type="fn" rid="fn336">336</xref><fn id="fn336"><label>336</label><p>The landmark paper &#x201C;Little (1974)&#x201D; was not listed in the Web of Science database as of Nov 2018, using the search keywords [<monospace>au=(little) and py=(1974) and ts=(brain)</monospace>]. On the other hand, [<monospace>au=(little) and ts=(The existence of persistent states in the brain)</monospace>], i.e., the author&#x2019;s last name and the full title of the paper, led to the 1995 collection of Little&#x2019;s papers edited by Cabrera et al., in which &#x2018;Little (1974)&#x2019;was found.</p></fn>. 1077, 1288</comment></mixed-citation></ref>
<ref id="ref-36"><label>36.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ramachandran</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Barret</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Le</surname>, <given-names>Q.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Searching for Activation Functions</article-title>. <comment>CoRR (Computing Research Repository), abs/1710.05941v2. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1710.05941v2">arXiv:1710.05941v2</ext-link>. See Footnote <xref ref-type="fn" rid="fn337">337</xref>. 1077, 1120, 1287, 1289, 1290, 1291</comment></mixed-citation></ref>
<ref id="ref-37"><label>37.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Wuraola</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Patel</surname>, <given-names>N.</given-names></string-name></person-group> <article-title>SQNL: A New Computationally Efficient Activation Function</article-title>. In <source>2018 International Joint Conference on Neural Networks (IJCNN)</source>. <comment>1077, 1288</comment></mixed-citation></ref>
<ref id="ref-38"><label>38.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Oishi</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Yagawa</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Computational mechanics enhanced by deep learning</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>327</volume>, <fpage>327</fpage>&#x2013;<lpage>351</lpage>. <comment>1077, 1079, 1086, 1087, 1088, 1089, 1100, 1114, 1121, 1128, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1238, 1239, 1240, 1277</comment></mixed-citation></ref>
<ref id="ref-39"><label>39.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Zienkiewicz</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Taylor</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2013</year>). <source>The Finite Element Method: Its Basis and Fundamentals</source>. <publisher-name>Oxford</publisher-name>: <publisher-loc>Butterworth-Heineman</publisher-loc>. <edition>7th edition</edition>. <comment>1077, 1103, 1231, 1232</comment></mixed-citation></ref>
<ref id="ref-40"><label>40.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Barlow</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1976</year>). <article-title>Optimal stress locations in finite-element models</article-title>. <source>International Journal for Numerical Methods in Engineering</source>, <volume>10</volume>(<issue>2</issue>), <fpage>243</fpage>&#x2013;<lpage>251</lpage>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-41"><label>41.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Barlow</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1977</year>). <article-title>Optimal stress locations in finite-element models - reply</article-title>. <source>International Journal for Numerical Methods in Engineering</source>, <volume>11</volume>(<issue>3</issue>), <fpage>604</fpage>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-42"><label>42.</label><mixed-citation publication-type="journal"><article-title>Abaqus 6.14</article-title>. <source>Theory Guide</source>. <comment>Simulia Systems, Dassault Syst&#x00E8;mes. Subsection 3.2.4 Solid isoparametric quadrilaterals and hexahedra</comment>. (<ext-link ext-link-type="uri" xlink:href="http://ivt-abaqusdoc.ivt.ntnu.no:2080/texis/search/?query=wetting&#x0026;submit.x=0&#x0026;submit.y=0&#x0026;group=bk&#x0026;CDB=v6.14">Website</ext-link>, <comment>go to Section Reference, Abaqus Theory Guide, Section 3 Elements, Section 3.2 Continuum elements, then Section 3.2.4.</comment>). <comment>1077</comment></mixed-citation></ref>
<ref id="ref-43"><label>43.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ghaboussi</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Garrett</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>X.</given-names></string-name></person-group> (<year>1990</year>). <article-title>Material Modeling with Neural NetworksIn Pande, GN and Middleton, J</article-title>. <source>Numerical Methods in Engineering: Theory and Applications</source>, <volume>Vol 2</volume>. <conf-name>3rd International Conf on Numerical Methods in Engineering: Theory and Applications ( NUMETA 90 )</conf-name>, <conf-loc>Univ Coll Swansea, Swansea, Wales</conf-loc>, <conf-date>Jan 07-11, 1990</conf-date>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-44"><label>44.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>C.</given-names></string-name></person-group> (<year>1989</year>). <article-title>Applying and validating neural network technology for nondestructive evaluation of materials</article-title> In <source>1989 IEEE International Conference on Systems, Man, and Cybernetics, Vols 1-3: Conference Proceedings</source>. <conf-name>1989 IEEE International Conf on Systems, Man, and Cybernetics : Decision-Making in Large-Scale Systems</conf-name>, <conf-loc>Cambridge, MA</conf-loc>, <conf-date>Nov 14-17, 1989</conf-date>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-45"><label>45.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sayeh</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Viswanathan</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Dhali</surname>, <given-names>S.</given-names></string-name></person-group> (<year>1990</year>). <article-title>Neural networks for assessment of impact and stress relief on composite-materialsIn Genisio, M</article-title>. <source>Sixth Annual Conference on Materials Technology: Composite Technology</source>. <conf-name>6th Annual Conf on Materials Technology: Composite Technology, Southern Illinois Univ Carbondale</conf-name>, <conf-loc>Carbondale, IL</conf-loc>, <conf-date>Apr 10-11, 1990</conf-date>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-46"><label>46.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Leclair</surname>, <given-names>S.</given-names></string-name></person-group> (<year>1991</year>). <article-title>A probability neural network (pnn) estimator for improved reliability of noisy sensor data.</article-title> <source>Journal of Reinforced Plastics and Composites</source>, <volume>10</volume>(<issue>4</issue>), <fpage>379</fpage>&#x2013;<lpage>390</lpage>. <comment>1077</comment></mixed-citation></ref>
<ref id="ref-47"><label>47.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Kim</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Choi</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Widemann</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zohdi</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2020</year>). <article-title>A fast and accurate physics-informed neural network reduced order model with shallow masked autoencoderer</article-title>. (<italic>Sep 28</italic>). <comment>Version 2, 2020.09.28</comment>: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2009.11990v2">arXiv:2009.11990v2</ext-link>, <monospace>2009.11990</monospace>. <comment>1077, 1078, 1079, 1261, 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269, 1271, 1273, 1274, 1275</comment></mixed-citation></ref>
<ref id="ref-48"><label>48.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Kim</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Choi</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Widemann</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zohdi</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Efficient nonlinear manifold reduced order model</article-title>. (<italic>Nov 13</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2011.07727">arXiv:2011.07727</ext-link>, <monospace>2011.07727</monospace>. <comment>1077, 1078, 1079, 1261</comment></mixed-citation></ref>
<ref id="ref-49"><label>49.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Robbins</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Monro</surname>, <given-names>S.</given-names></string-name></person-group> (<year>1951b</year>). <article-title>Stochastic approximation</article-title>. <source>Annals of Mathematical Statistics</source>, <volume>22</volume>(<issue>2</issue>), <fpage>316</fpage>. <comment>1078, 1155, 1161, 1185</comment></mixed-citation></ref>
<ref id="ref-50"><label>50.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Nesterov</surname>, <given-names>I.</given-names></string-name></person-group> (<year>1983</year>). <article-title>A method of the solution of the convex-programming problem with a speed of convergence O(1/<italic>k</italic><sup>2</sup>)</article-title>. <source>Doklady Akademii Nauk SSSR</source>, <volume>269</volume>(<issue>3</issue>), <fpage>543</fpage>&#x2013;<lpage>547</lpage>. <comment>In Russian</comment>. <comment>1078, 1157, 1159</comment></mixed-citation></ref>
<ref id="ref-51"><label>51.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Nesterov</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2018</year>). <source>Lecture on Convex Optimization</source>. <edition>2nd edition</edition>. <publisher-loc>Switzerland</publisher-loc>: <publisher-name>Springer Nature</publisher-name>. <comment>1078, 1157, 1159</comment></mixed-citation></ref>
<ref id="ref-52"><label>52.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Duchi</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hazan</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Singer</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2011</year>). <article-title>Adaptive Subgradient Methods for Online Learning and Stochastic Optimization</article-title>. <source>Journal of Machine Learning Research</source>, <volume>12</volume>, <fpage>2121</fpage>&#x2013;<lpage>2159</lpage>. <comment>1078, 1173</comment></mixed-citation></ref>
<ref id="ref-53"><label>53.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Tieleman</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2012</year>). <article-title>Lecture 6e, rmsprop: Divide the gradient by a running average of its recent magnitude</article-title>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=defQQqkXEfE&#x0026;list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&#x0026;index=29">Youtube video</ext-link>, <comment>time 5:54. Lecture notes, p.29: <ext-link ext-link-type="uri" xlink:href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://web.archive.org/web/20191117085823/https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">Internet archive</ext-link>. 1078, 1176</comment></mixed-citation></ref>
<ref id="ref-54"><label>54.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Zeiler</surname>, <given-names>M. D.</given-names></string-name></person-group> (<year>2012</year>). <article-title>ADADELTA: An adaptive learning rate method</article-title>. (<italic>Dec 22</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1212.5701">arXiv:1212.5701</ext-link>. <comment>1078, 1174, 1176, 1177</comment></mixed-citation></ref>
<ref id="ref-55"><label>55.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Wilson</surname>, <given-names>A. C.</given-names></string-name>, <string-name><surname>Roelofs</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Stern</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Srebro</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Recht</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2018</year>). <article-title>The marginal value of adaptive gradient methods in machine learning</article-title>. (<italic>May 22</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1705.08292v2">arXiv:1705.08292v2</ext-link>. <comment>Version 1 appeared in 2017, see also the</comment> <ext-link ext-link-type="uri" xlink:href="https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips30/reviews/2186.html">Reviews for NIPS 2017</ext-link>. <comment>1078, 1141, 1153, 1155, 1158, 1159, 1160, 1176, 1178, 1182, 1183, 1185</comment></mixed-citation></ref>
<ref id="ref-56"><label>56.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Loshchilov</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Hutter</surname>, <given-names>F.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Decoupled weight decay regularization</article-title>. (<italic>Jan 4</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1711.05101v3">arXiv:1711.05101v3</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=Bkg6RiCqY7">OpenReview</ext-link>. <comment>1078, 1153, 1155, 1160, 1161, 1167, 1174, 1177, 1183, 1184, 1185, 1191</comment></mixed-citation></ref>
<ref id="ref-57"><label>57.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Bahdanau</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Cho</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Neural machine translation by jointly learning to align and translate</article-title>. <source>CoRR, abs/1409.0473</source>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1409.0473">arXiv:1409.0473</ext-link>. <comment>1079, 1203, 1204, 1205, 1206</comment></mixed-citation></ref>
<ref id="ref-58"><label>58.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Furshpan</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Potter</surname>, <given-names>D.</given-names></string-name></person-group> (<year>1957</year>). <article-title>Mechanism of nerve-impulse transmission at a crayfish synapse</article-title>. <source>Nature</source>, <volume>180</volume>(<issue>4581</issue>), <fpage>342</fpage>&#x2013;<lpage>343</lpage>. <comment>1079, 1290</comment>; <pub-id pub-id-type="pmid">13464833</pub-id></mixed-citation></ref>
<ref id="ref-59"><label>59.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Furshpan</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Potter</surname>, <given-names>D.</given-names></string-name></person-group> (<year>1959b</year>). <article-title>Slow post-synaptic potentials recorded from the giant motor fibre of the crayfish</article-title>. <source>Journal of Physiology-London</source>, <volume>145</volume>(<issue>2</issue>), <fpage>326</fpage>&#x2013;<lpage>335</lpage>. <comment>1079, 1290</comment></mixed-citation></ref>
<ref id="ref-60"><label>60.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Gershgorn</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2017</year>). <article-title>The data that transformed AI research&#x2014;and possibly the world.</article-title> <source>Quartz</source>, (<italic>Jul 26</italic>). <ext-link ext-link-type="uri" xlink:href="https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/">Original website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20170804013913/https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/">Internet archive</ext-link> (<comment>blurry images</comment>). <comment>1079, 1081</comment></mixed-citation></ref>
<ref id="ref-61"><label>61.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>He</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Ren</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification</article-title>. <comment>CoRR, abs/1502.01852</comment>. <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1502.01852">arXiv:1502.01852</ext-link>, <monospace>1502.01852</monospace>. <comment>1080, 1108, 1138, 1274, 1288</comment></mixed-citation></ref>
<ref id="ref-62"><label>62.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Russakovsky</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Su</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Krause</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Satheesh</surname>, <given-names>S.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2015</year>). <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>. <source>International Journal of Computer Vision</source>, <volume>115</volume>(<issue>3</issue>), <fpage>211</fpage>&#x2013;<lpage>252</lpage>. <comment>1080, 1081</comment></mixed-citation></ref>
<ref id="ref-63"><label>63.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Park</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Russakovsky</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>F.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2017</year>). <article-title>ImageNet Large scale visual recognition challenge (ILSVRC) 2017, Overview</article-title>. <source>ILSVRC 2017</source>, (<source>Jul 26</source>). <ext-link ext-link-type="uri" xlink:href="http://image-net.org/challenges/talks_2017/ILSVRC2017_overview.pdf">Original website</ext-link> <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20180306152853/http://image-net.org/challenges/talks_2017/ILSVRC2017_overview.pdf">Internet archive</ext-link>. <comment>1080, 1081</comment></mixed-citation></ref>
<ref id="ref-64"><label>64.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Beckwith</surname>, <given-names>W.</given-names></string-name></person-group> <article-title>Science&#x2019;s 2021 Breakthrough: AI-powered Protein Prediction</article-title>. <comment>2022 Dec 17</comment>, <ext-link ext-link-type="uri" xlink:href="https://www.aaas.org/news/sciences-2021-breakthrough-ai-powered-protein-prediction">Original website</ext-link>. <comment>1079, 1080</comment></mixed-citation></ref>
<ref id="ref-65"><label>65.</label><mixed-citation publication-type="web"><article-title>AlphaFold reveals the structure of the protein universe</article-title>. <ext-link ext-link-type="uri" xlink:href="https://www.deepmind.com/">DeepMind</ext-link>, <comment>2022 Jul 28</comment>, <ext-link ext-link-type="uri" xlink:href="https://www.deepmind.com/blog/alphafold-reveals-the-structure-of-the-protein-universe">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220826234657/https://www.deepmind.com/blog/alphafold-reveals-the-structure-of-the-protein-universe">Internet archive</ext-link>. <comment>1080</comment></mixed-citation></ref>
<ref id="ref-66"><label>66.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Callaway</surname>, <given-names>E.</given-names></string-name></person-group> <article-title>DeepMind&#x2019;s AI predicts structures for a vast trove of proteins</article-title>. <comment>2021 Jul 21</comment>, <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/d41586-021-02025-4">Original website</ext-link>. <comment>1080</comment></mixed-citation></ref>
<ref id="ref-67"><label>67.</label><mixed-citation publication-type="web"><collab>Editorial</collab> (<year>2019</year>). <article-title>The Guardian view on the future of AI: Great power, great irresponsibility</article-title>. <source>The Guardian</source>, (<italic>Jan 01</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.theguardian.com/commentisfree/2019/jan/01/the-guardian-view-on-the-future-of-ai-great-power-great-irresponsibility">Original website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190109074917/https://www.theguardian.com/commentisfree/2019/jan/01/the-guardian-view-on-the-future-of-ai-great-power-great-irresponsibility">Internet archive</ext-link>. <comment>1080, 1304, 1305</comment></mixed-citation></ref>
<ref id="ref-68"><label>68.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Silver</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Hubert</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Schrittwieser</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Antonoglou</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Lai</surname>, <given-names>M.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2018</year>). <article-title>A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play</article-title>. <source>Science</source>, <volume>362</volume>(<issue>6419</issue>), <fpage>1140+</fpage>. <comment>1080</comment>; <pub-id pub-id-type="pmid">30523106</pub-id></mixed-citation></ref>
<ref id="ref-69"><label>69.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mnih</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Kavukcuoglu</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Silver</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Rusu</surname>, <given-names>A. A.</given-names></string-name>, <string-name><surname>Veness</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2015</year>). <article-title>Human-level control through deep reinforcement learning</article-title>. <source>Nature</source>, <volume>518</volume>(<issue>7540</issue>), <fpage>529</fpage>&#x2013;<lpage>533</lpage>. <comment>1081</comment>; <pub-id pub-id-type="pmid">25719670</pub-id></mixed-citation></ref>
<ref id="ref-70"><label>70.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Racaniere</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Weber</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Reichert</surname>, <given-names>D. P.</given-names></string-name>, <string-name><surname>Buesing</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Guez</surname>, <given-names>A.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2017</year>). <article-title>Imagination-Augmented Agents for Deep Reinforcement Learning</article-title>. In <person-group person-group-type="editor"><string-name><surname>Guyon</surname>, <given-names>I</given-names></string-name> and <string-name><surname>Luxburg</surname>, <given-names>UV</given-names></string-name> and <string-name><surname>Bengio</surname>, <given-names>S</given-names></string-name> and <string-name><surname>Wallach</surname>, <given-names>H</given-names></string-name> and <string-name><surname>Fergus</surname>, <given-names>R</given-names></string-name> and <string-name><surname>Vishwanathan</surname>, <given-names>S</given-names></string-name> and <string-name><surname>Garnett</surname>, <given-names>R</given-names></string-name></person-group>, editor, <source>Advances in Neural Information Processing Systems 30 (NIPS 2017)</source>, volume 30 of <source>Advances in Neural Information Processing Systems</source>. <comment>1081</comment></mixed-citation></ref>
<ref id="ref-71"><label>71.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Silver</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Schrittwieser</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Simonyan</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Antonoglou</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>A.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2017</year>). <article-title>Mastering the game of Go without human knowledge.</article-title> <source>Nature</source>, <volume>550</volume>(<issue>7676</issue>), <fpage>354+</fpage>. <comment>1081</comment>; <pub-id pub-id-type="pmid">29052630</pub-id></mixed-citation></ref>
<ref id="ref-72"><label>72.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Cellan-Jones</surname>, <given-names>Rory</given-names></string-name></person-group> (<year>2017</year>). <article-title>Artificial intelligence - hype, hope and fear</article-title>. <source>BBC</source>, (<italic>Oct 16</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.bbc.com/news/technology-41634316">Original website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20180723132129/https://www.bbc.com/news/technology-41634316">Internet archive</ext-link>. <comment>1081</comment></mixed-citation></ref>
<ref id="ref-73"><label>73.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Campbell</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Mastering board games. A single algorithm can learn to play three hard board games</article-title>. <source>Science</source>, <volume>362</volume>(<issue>6419</issue>), <fpage>1118</fpage>. <comment>1081</comment>; <pub-id pub-id-type="pmid">30523099</pub-id></mixed-citation></ref>
<ref id="ref-74"><label>74.</label><mixed-citation publication-type="web"><source>The Economist</source> (<year>2016</year>). <article-title>Why artificial intelligence is enjoying a renaissance</article-title>. (<italic>Jul 15</italic>). (<ext-link ext-link-type="uri" xlink:href="https://goo.gl/Grkofq">https://goo.gl/Grkofq</ext-link>). <comment>1081, 1122, 1294</comment></mixed-citation></ref>
<ref id="ref-75"><label>75.</label><mixed-citation publication-type="web"><collab>The Economist</collab> (<year>2016</year>). <article-title>From not working to neural networking</article-title>. (<italic>Jun 25</italic>). (<ext-link ext-link-type="uri" xlink:href="https://goo.gl/z1c9pc">https://goo.gl/z1c9pc</ext-link>). <comment>1081, 1120, 1122, 1294</comment></mixed-citation></ref>
<ref id="ref-76"><label>76.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Dodge</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Karam</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2017</year>). <article-title>A Study and Comparison of Human and Deep Learning Recognition Performance Under Visual Distortions</article-title>. (<italic>May 6</italic>). <comment>CoRR (Computing Research Repository)</comment>, <comment>abs/1705.02498.<xref ref-type="fn" rid="fn337">337</xref><fn id="fn337"><label>337</label><p>In a rather cryptic manner to outsiders, several computer-science papers refer to papers in the Computing Research Repository (CoRR) such as, e.g., &#x201C;CoRR abs/1706.03762v5&#x201D;, which means that the abstract of paper number &#x201C;1706.03762v5&#x201D; (version 5) can be accessed by prepending to &#x201C;abs/1706.03762v5&#x201D; the CoRR web address <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/">https://arxiv.org/</ext-link> to form <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1706.03762v5">https://arxiv.org/abs/1706.03762v5</ext-link>, which can also be obtained via a web search of &#x201C;abs/1706.03762v5&#x201D;, and where the PDF of the paper can be downloaded. An equivalent reference is &#x201C;arXiv preprint arXiv:1706.03762v5&#x201D;, which may be clearer since more non-computer-science readers would have heard of the arXiv rather than the CoRR. Papers such as [31] use both types of references, which are also used in the present review paper so readers become familiar with both. To refer to the specific version 5, use &#x201C;CoRR abs/1706.03762v5&#x201D;; to refer to the latest version (which may be different from version 5), remove &#x201C;v5&#x201D; to use only &#x201C;CoRR abs/1706.03762&#x201D;.</p></fn> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1705.02498">arXiv:1705.02498</ext-link>. 1081</comment></mixed-citation></ref>
<ref id="ref-77"><label>77.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hardesty</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Explained: Neural networks</article-title>. <source>MIT News</source>, (<italic>Apr 14</italic>). <ext-link ext-link-type="uri" xlink:href="http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414">Original website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20181110195900/http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414">Internet archive</ext-link>. <comment>1081, 1278</comment></mixed-citation></ref>
<ref id="ref-78"><label>78.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Goodfellow</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Courville</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2016</year>). <source>Deep Learning</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>The MIT Press</publisher-name>. <comment>1082, 1084, 1085, 1095, 1100, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1112, 1114, 1115, 1116, 1117, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1133, 1135, 1137, 1138, 1140, 1141, 1143, 1144, 1145, 1146, 1152, 1153, 1154, 1155, 1157, 1158, 1159, 1160, 1161, 1167, 1168, 1170, 1172, 1174, 1176, 1177, 1180, 1182, 1183, 1194, 1195, 1196, 1197, 1199, 1202, 1203, 1204, 1265, 1277, 1278, 1280, 1281, 1284, 1288, 1289, 1290, 1291, 1293, 1334, 1335, 1336, 1339, 1340, 1343</comment></mixed-citation></ref>
<ref id="ref-79"><label>79.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Ford</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2018</year>). <source>Architects of Intelligence: The truth about AI from the people building it</source>. <publisher-name>Packt Publishing</publisher-name>. <comment>1082, 1084, 1289, 1291, 1292, 1293, 1303</comment></mixed-citation></ref>
<ref id="ref-80"><label>80.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bottou</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Curtis</surname>, <given-names>F. E.</given-names></string-name>, <string-name><surname>Nocedal</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2018</year>). <article-title>OptimizationMethods for Large-Scale Machine Learning</article-title>. <source>SIAM Review</source>, <volume>60</volume>(<issue>2</issue>), <fpage>223</fpage>&#x2013;<lpage>311</lpage>. <comment>1082, 1144, 1146, 1152, 1153, 1155, 1161, 1174, 1176, 1177</comment></mixed-citation></ref>
<ref id="ref-81"><label>81.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Khullar</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2019</year>). <article-title>A.I. Could Worsen Health Disparities</article-title>. <source>New York Times</source>, (<italic>Jan 31</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2019/01/31/opinion/ai-bias-healthcare.html">Original website</ext-link>. <comment>1082</comment></mixed-citation></ref>
<ref id="ref-82"><label>82.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Kornfield</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Firozi</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Artificial intelligence use is growing in the U.S. healthcare system</article-title>. <source>Washington Post</source>, (<italic>Feb 24</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.washingtonpost.com/news/powerpost/paloma/the-health-202/2020/02/24/the-health-202-artificial-intelligence-use-is-growing-in-the-u-s-health-care-system/5e52f13188e0fa632ba81ec7/">Original website</ext-link>. <comment>1082</comment></mixed-citation></ref>
<ref id="ref-83"><label>83.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Lee</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2018a</year>). <source>AI Superpowers: China, Silicon Valley, and the New World Order</source>. <publisher-name>Houghton Mifflin Harcourt</publisher-name>. <comment>1082</comment></mixed-citation></ref>
<ref id="ref-84"><label>84.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lee</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2018b</year>). <article-title>How AI can save our humanity</article-title>. <source>TED2018</source>, (<italic>Apr</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.ted.com/talks/kai_fu_lee_how_ai_can_save_our_humanity">Original website</ext-link>. <comment>1082</comment></mixed-citation></ref>
<ref id="ref-85"><label>85.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dunjko</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Briegel</surname>, <given-names>H. J.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Machine learning &#x0026; artificial intelligence in the quantum domain: a review of recent progress</article-title>. <source>Reports on Progress in Physics</source>, <volume>81</volume>(<issue>7</issue>), <fpage>074001</fpage>. <comment>1084, 1085</comment>; <pub-id pub-id-type="pmid">29504942</pub-id></mixed-citation></ref>
<ref id="ref-86"><label>86.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hinton</surname>, <given-names>G. E.</given-names></string-name>, <string-name><surname>Osindero</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Teh</surname>, <given-names>Y. W.</given-names></string-name></person-group> (<year>2006</year>). <article-title>A fast learning algorithm for deep belief nets</article-title>. <source>Neural Computation</source>, <volume>18</volume>(<issue>7</issue>), <fpage>1527</fpage>&#x2013;<lpage>1554</lpage>. <comment>1084</comment>; <pub-id pub-id-type="pmid">16764513</pub-id></mixed-citation></ref>
<ref id="ref-87"><label>87.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Merolla</surname>, <given-names>P. A.</given-names></string-name>, <string-name><surname>Arthur</surname>, <given-names>J. V.</given-names></string-name>, <string-name><surname>Alvarez-Icaza</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Cassidy</surname>, <given-names>A. S.</given-names></string-name>, <string-name><surname>Sawada</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2014</year>). <article-title>A million spiking-neuron integrated circuit with a scalable communication network and interface</article-title>. <source>Science</source>, <volume>345</volume>(<issue>6197</issue>), <fpage>668</fpage>&#x2013;<lpage>673</lpage>. <comment>1085</comment>; <pub-id pub-id-type="pmid">25104385</pub-id></mixed-citation></ref>
<ref id="ref-88"><label>88.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Esser</surname>, <given-names>S. K.</given-names></string-name>, <string-name><surname>Merolla</surname>, <given-names>P. A.</given-names></string-name>, <string-name><surname>Arthur</surname>, <given-names>J. V.</given-names></string-name>, <string-name><surname>Cassidy</surname>, <given-names>A. S.</given-names></string-name>, <string-name><surname>Appuswamy</surname>, <given-names>R.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2016</year>). <article-title>Convolutional networks for fast, energy-efficient neuromorphic computing</article-title>. <source>Proceedings of the National Academy of Sciences of the United States of America</source>, <volume>113</volume>(<issue>41</issue>), <fpage>11441</fpage>&#x2013;<lpage>11446</lpage>. <comment>1085</comment>; <pub-id pub-id-type="pmid">27651489</pub-id></mixed-citation></ref>
<ref id="ref-89"><label>89.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Warren</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Root</surname>, <given-names>P.</given-names></string-name></person-group> (<year>1963</year>). <article-title>The behavior of naturally fractured reservoirs</article-title>. <source>Society of Petroleum Engineers Journal</source>, <volume>3</volume>(<issue>03</issue>), <fpage>245</fpage>&#x2013;<lpage>255</lpage>. <comment>1090</comment></mixed-citation></ref>
<ref id="ref-90"><label>90.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ji</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Hall</surname>, <given-names>S. A.</given-names></string-name>, <string-name><surname>Baud</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Wong</surname>, <given-names>T. F.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Characterization of pore structure and strain localization in Majella limestone by X-ray computed tomography and digital image correlation</article-title>. <source>Geophysical Journal International</source>, <volume>200</volume>(<issue>2</issue>), <fpage>701</fpage>&#x2013;<lpage>719</lpage>. <comment>1091, 1092</comment></mixed-citation></ref>
<ref id="ref-91"><label>91.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Christensen</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2013</year>). <source>The Theory of Materials Failure</source>. <edition>1st edition</edition>. <publisher-name>Oxford University Press</publisher-name>. <comment>1090</comment></mixed-citation></ref>
<ref id="ref-92"><label>92.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Balogun</surname>, <given-names>A. S.</given-names></string-name>, <string-name><surname>Kazemi</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ozkan</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Al-Kobaisi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Ramirez</surname>, <given-names>B. A.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2007</year>). <article-title>Verification and proper use of water-oil transfer function for dual-porosity and dual-permeability reservoirs</article-title>. In <source>SPE Middle East Oil and Gas Show and Conference</source>. <publisher-name>Society of Petroleum Engineers</publisher-name>. <comment>1090</comment></mixed-citation></ref>
<ref id="ref-93"><label>93.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ho</surname>, <given-names>C. K.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Dual porosity vs. dual permeability models of matrix diffusion in fractured rock</article-title>. <comment>Technical report</comment>. <conf-name>International High-Level Radioactive Waste Conference</conf-name>, <conf-loc>Las Vegas, NV (US)</conf-loc>, <conf-date>04/29/2001-05/03/2001</conf-date>. <comment>Sandia National Laboratories, Albuquerque, NM (US), Report No. SAND2000-2336C. Office of Scientific &#x0026; Technical Information Report Number 763324</comment>. <ext-link ext-link-type="uri" xlink:href="https://inis.iaea.org/collection/NCLCollectionStore/_Public/32/026/32026591.pdf">PDF archived at the International Atomic Energy Agency</ext-link>. <comment>1090, 1091</comment></mixed-citation></ref>
<ref id="ref-94"><label>94.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Datta-Gupta</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>King</surname>, <given-names>M. J.</given-names></string-name></person-group> (<year>2007</year>). <source>Streamline simulation: Theory and practice</source>, <comment>volume 11</comment>. <publisher-name>Society of Petroleum Engineers Richardson</publisher-name>. <comment>1090, 1091, 1092</comment></mixed-citation></ref>
<ref id="ref-95"><label>95.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Croiz&#x00E9;</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Renard</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Gratier</surname>, <given-names>J. P.</given-names></string-name></person-group> (<year>2013</year>). <chapter-title>Chapter 3 - compaction and porosity reduction in carbonates: A review of observations, theory, and experiments</chapter-title>. In <person-group person-group-type="editor"><string-name><surname>R.</surname> <given-names>Dmowska</given-names></string-name></person-group>, <comment>editor</comment>, <source>Advances in Geophysics</source>, <comment>volume 54 of</comment> <source>Advances in Geophysics</source>. <publisher-name>Elsevier</publisher-name>, <fpage>181</fpage>&#x2013;<lpage>238</lpage>. <comment>1091, 1092</comment></mixed-citation></ref>
<ref id="ref-96"><label>96.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Qu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Rahman</surname>, <given-names>M. M.</given-names></string-name></person-group> (<year>2019</year>). <article-title>A new dual-permeability model for naturally fractured reservoirs</article-title>. <source>Special Topics &#x0026; Reviews in Porous Media: An International Journal</source>, <volume>10</volume>(<issue>5</issue>). <comment>1091</comment></mixed-citation></ref>
<ref id="ref-97"><label>97.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Gers</surname>, <given-names>F. A.</given-names></string-name>, <string-name><surname>Schmidhuber</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Recurrent nets that time and countIn</article-title> <source>Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks</source>. <publisher-name>IEEE</publisher-name>. <comment>1092</comment></mixed-citation></ref>
<ref id="ref-98"><label>98.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Santamarina</surname>, <given-names>J. C.</given-names></string-name></person-group> (<year>2003</year>). <article-title>Soil behavior at the microscale: particle forces</article-title>. In <source>Soil behavior and soft ground construction</source>. <fpage>25</fpage>&#x2013;<lpage>56</lpage>. <conf-name>Proc. of the Symposium in honor of Charles C. Ladd</conf-name>, <conf-date>October 2001</conf-date>, <publisher-name>MIT</publisher-name>. <comment>1095</comment></mixed-citation></ref>
<ref id="ref-99"><label>99.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Alam</surname>, <given-names>M. F.</given-names></string-name>, <string-name><surname>Haque</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ranjith</surname>, <given-names>P. G.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A study of the particle-level fabric and morphology of granular soils under one-dimensional compression using insitu x-ray ct imaging</article-title>. <source>Materials</source>, <volume>11</volume>(<issue>6</issue>), <fpage>919</fpage>. <comment>1094</comment>; <pub-id pub-id-type="pmid">29844280</pub-id></mixed-citation></ref>
<ref id="ref-100"><label>100.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Karatza</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>And&#x00F2;</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Papanicolopulos</surname>, <given-names>S. A.</given-names></string-name>, <string-name><surname>Viggiani</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Ooi</surname>, <given-names>J. Y.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Effect of particle morphology and contacts on particle breakage in a granular assembly studied using x-ray tomography</article-title>. <source>Granular Matter</source>, <volume>21</volume>(<issue>3</issue>), <fpage>44</fpage>. <comment>1094</comment></mixed-citation></ref>
<ref id="ref-101"><label>101.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shire</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>O&#x2019;Sullivan</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Hanley</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Fannin</surname>, <given-names>R. J.</given-names></string-name></person-group> (<year>2014</year>). <article-title>Fabric and effective stress distribution in internally unstable soils</article-title>. <source>Journal of Geotechnical and Geoenvironmental Engineering</source>, <volume>140</volume>(<issue>12</issue>), <fpage>04014072</fpage>. <comment>1094</comment></mixed-citation></ref>
<ref id="ref-102"><label>102.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kanatani</surname>, <given-names>K. I.</given-names></string-name></person-group> (<year>1984</year>). <article-title>Distribution of directional data and fabric tensors</article-title>. <source>International Journal of Engineering Science</source>, <volume>22</volume>(<issue>2</issue>), <fpage>149</fpage>&#x2013;<lpage>164</lpage>. <comment>1094, 1242</comment></mixed-citation></ref>
<ref id="ref-103"><label>103.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fu</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Dafalias</surname>, <given-names>Y. F.</given-names></string-name></person-group>  (<year>2015</year>). <article-title>Relationship between void-and contact normal-based fabric tensors for 2d idealized granular materials</article-title>. <source>International Journal of Solids and Structures</source>, <volume>63</volume>, <fpage>68</fpage>&#x2013;<lpage>81</lpage>. <comment>1094</comment></mixed-citation></ref>
<ref id="ref-104"><label>104.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Graves</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Schmidhuber</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2005</year>). <article-title>Framewise phoneme classification with bidirectional LSTM and other neural network architectures</article-title>. <source>Neural Networks</source>, <volume>18</volume>(<issue>5&#x2013;6</issue>), <fpage>602</fpage>&#x2013;<lpage>610</lpage>. <comment>1097</comment>; <pub-id pub-id-type="pmid">16112549</pub-id></mixed-citation></ref>
<ref id="ref-105"><label>105.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Graham</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kanov</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Malaya</surname>, <given-names>N.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2016</year>). <article-title>A web services accessible database of turbulent channel flow and its use for testing a new integral wall model for les</article-title>. <source>Journal of Turbulence</source>, <volume>17</volume>(<issue>2</issue>), <fpage>181</fpage>&#x2013;<lpage>215</lpage>. <comment>1097, 1256</comment></mixed-citation></ref>
<ref id="ref-106"><label>106.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Rossant</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Goodman</surname>, <given-names>D. F. M.</given-names></string-name>, <string-name><surname>Fontaine</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Platkiewicz</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Magnusson</surname>, <given-names>A. K.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2011</year>). <article-title>Fitting neuron models to spike trains</article-title>. <source>Frontiers in Neuroscience</source>, <italic>Feb 23</italic>. <comment>1099</comment></mixed-citation></ref>
<ref id="ref-107"><label>107.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Brillouin</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1964</year>). <source>Tensors in Mechanics and Elasticity</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Academic Press</publisher-name>. <comment>1100, 1102</comment></mixed-citation></ref>
<ref id="ref-108"><label>108.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Misner</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Thorne</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Wheeler</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1973</year>). <source>Gravitation</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>W.H. Freeman and Company</publisher-name>. <comment>1100</comment></mixed-citation></ref>
<ref id="ref-109"><label>109.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Malvern</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1969</year>). <source>Introduction to the Mechanics of a Continuous Medium</source>. <publisher-loc>Englewood Cliffs, New Jersey</publisher-loc>: <publisher-name>Prentice Hall</publisher-name>. <comment>1102</comment></mixed-citation></ref>
<ref id="ref-110"><label>110.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Marsden</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hughes</surname>, <given-names>T.</given-names></string-name></person-group> (<year>1994</year>). <source>Mathematical Foundation of Elasticity</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Dover</publisher-name>. <comment>1102</comment></mixed-citation></ref>
<ref id="ref-111"><label>111.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>S.</given-names></string-name></person-group> (<year>1995</year>). <article-title>Dynamics of sliding geometrically-exact beams - large-angle maneuver and parametric resonance</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>120</volume>(<issue>1-2</issue>), <fpage>65</fpage>&#x2013;<lpage>118</lpage>. <comment>1102</comment></mixed-citation></ref>
<ref id="ref-112"><label>112.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Werbos</surname>, <given-names>P.</given-names></string-name></person-group> (<year>1988</year>). <article-title>Backpropagation: Past and future</article-title>. <conf-name>IEEE 1988 International Conference on Neural Networks</conf-name>, <conf-loc>San Diego</conf-loc>, <conf-date>24-27 July 1988</conf-date>. <comment>1106, 1293</comment></mixed-citation></ref>
<ref id="ref-113"><label>113.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Glorot</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Bordes</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2011</year>). <article-title>Deep Sparse Rectifier Neural Networks</article-title>. <source>Proceedings of Machine Learning Research (PMLR), Vol.15, Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 11-13 April 2011, Fort Lauderdale, FL, USA</source>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20180412023920/http://proceedings.mlr.press/v15/">PMLR Vol.15, AISTATS 2011</ext-link> <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20180328160352/http://proceedings.mlr.press:80/v15/glorot11a/glorot11a.pdf">Paper pdf</ext-link>. <comment>1107, 1109, 1111, 1112, 1137, 1138, 1289</comment></mixed-citation></ref>
<ref id="ref-114"><label>114.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Drion</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>O&#x2019;Leary</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Marder</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Ion channel degeneracy enables robust and tunable neuronal firing rates</article-title>. <source>Proceedings of the National Academy of Sciences of the United States of America</source>, <volume>112</volume>(<issue>38</issue>), <fpage>E5361</fpage>&#x2013;<lpage>E5370</lpage>. <comment>1108, 1110</comment>; <pub-id pub-id-type="pmid">26354124</pub-id></mixed-citation></ref>
<ref id="ref-115"><label>115.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>van Welie</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>van Hooft</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wadman</surname>, <given-names>W.</given-names></string-name></person-group> (<year>2004</year>). <article-title>Homeostatic scaling of neuronal excitability by synaptic modulation of somatic hyperpolarization-activated I-h channels</article-title>. <source>Proceedings of the National Academy of Sciences of the United States of America</source>, <volume>101</volume>(<issue>14</issue>), <fpage>5123</fpage>&#x2013;<lpage>5128</lpage>. <comment>1111</comment>; <pub-id pub-id-type="pmid">15051886</pub-id></mixed-citation></ref>
<ref id="ref-116"><label>116.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Steyn-Ross</surname>, <given-names>M. L.</given-names></string-name>, <string-name><surname>Steyn-Ross</surname>, <given-names>D. A.</given-names></string-name></person-group> (<year>2016</year>). <article-title>From individual spiking neurons to population behavior: Systematic elimination of short-wavelength spatial modes</article-title>. <source>Physical Review E</source>, <volume>93</volume>(<issue>2</issue>). <comment>1111, 1286</comment></mixed-citation></ref>
<ref id="ref-117"><label>117.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dutta</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Shukla</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Mohapatra</surname>, <given-names>N. R.</given-names></string-name>, <string-name><surname>Ganguly</surname>, <given-names>U.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Leaky Integrate and Fire Neuron by Charge-Discharge Dynamics in Floating-Body MOSFET</article-title>. <source>Scientific Reports</source>, <volume>7</volume>. <comment>1111</comment></mixed-citation></ref>
<ref id="ref-118"><label>118.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wilson</surname>, <given-names>H.</given-names></string-name></person-group> (<year>1999</year>). <article-title>Simplified dynamics of human and mammalian neocortical neurons</article-title>. <source>Journal of Theoretical Biology</source>, <volume>200</volume>(<issue>4</issue>), <fpage>375</fpage>&#x2013;<lpage>388</lpage>. <comment>1111, 1285, 1286</comment>; <pub-id pub-id-type="pmid">10525397</pub-id></mixed-citation></ref>
<ref id="ref-119"><label>119.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rosenblatt</surname>, <given-names>F.</given-names></string-name></person-group> (<year>1958</year>). <article-title>The perceptron - A probabilistic model for information-storage and organization in the brain</article-title>. <source>Psychological Review</source>, <volume>65</volume>(<issue>6</issue>), <fpage>386</fpage>&#x2013;<lpage>408</lpage>. <comment>1113, 1114, 1116, 1117, 1122, 1123, 1277, 1278, 1280, 1281, 1282, 1283</comment>; <pub-id pub-id-type="pmid">13602029</pub-id></mixed-citation></ref>
<ref id="ref-120"><label>120.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Block</surname>, <given-names>H.</given-names></string-name></person-group> (<year>1962a</year>). <article-title>Perceptron - A model for brain functioning .1</article-title>. <source>Reviews of Modern Physics</source>, <volume>34</volume>(<issue>1</issue>), <fpage>123</fpage>&#x2013;<lpage>135</lpage>. <comment>1113, 1114, 1278, 1281, 1282, 1283</comment></mixed-citation></ref>
<ref id="ref-121"><label>121.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Minsky</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Papert</surname>, <given-names>S.</given-names></string-name></person-group> (<year>1969</year>). <source>Perceptrons: An introduction to computational geometry</source>. <publisher-name>MIT Press</publisher-name>. <comment>1988 expanded edition. 2017 edition with foreword by Leon Bottou, Facebook AI</comment>. <comment>1114, 1115, 1281, 1282, 1283</comment></mixed-citation></ref>
<ref id="ref-122"><label>122.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Herzberger</surname>, <given-names>M.</given-names></string-name></person-group> (<year>1949</year>). <article-title>The normal equations of the method of least squares and their solution</article-title>. <source>Quarterly of Applied Mathematics</source>, <volume>7</volume>(<issue>2</issue>), <fpage>217</fpage>&#x2013;<lpage>223</lpage>. (<ext-link ext-link-type="uri" xlink:href="https://www.ams.org/journals/qam/1949-07-02/S0033-569X-1949-30815-5/S0033-569X-1949-30815-5.pdf">pdf</ext-link>). <comment>1116</comment></mixed-citation></ref>
<ref id="ref-123"><label>123.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Weisstein</surname>, <given-names>E. W.</given-names></string-name></person-group> <article-title>Normal equation</article-title>. <comment>From</comment> <ext-link ext-link-type="uri" xlink:href="http://mathworld.wolfram.com/">MathWorld</ext-link><comment>&#x2013;A Wolfram Web Resource. URL:</comment> <ext-link ext-link-type="uri" xlink:href="http://mathworld.wolfram.com/NormalEquation.html">http://mathworld.wolfram.com/NormalEquation.html</ext-link>. <comment>1116</comment></mixed-citation></ref>
<ref id="ref-124"><label>124.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dyson</surname>, <given-names>F.</given-names></string-name></person-group> (<year>2004</year>). <article-title>A meeting with Enrico Fermi - How one intuitive physicist rescued a team from fruitless research</article-title>. <source>Nature</source>, <volume>427</volume>(<issue>6972</issue>), <fpage>297</fpage>. (<ext-link ext-link-type="uri" xlink:href="https://www.ams.org/journals/qam/1949-07-02/S0033-569X-1949-30815-5/S0033-569X-1949-30815-5.pdf">pdf</ext-link>). <comment>1120</comment>; <pub-id pub-id-type="pmid">14737148</pub-id></mixed-citation></ref>
<ref id="ref-125"><label>125.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mayer</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Khairy</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Howard</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2010</year>). <article-title>Drawing an elephant with four complex parameters</article-title>. <source>American Journal of Physics</source>, <volume>78</volume>(<issue>6</issue>), <fpage>648</fpage>&#x2013;<lpage>649</lpage>. <comment>1120</comment></mixed-citation></ref>
<ref id="ref-126"><label>126.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Hsu</surname>, <given-names>J.</given-names></string-name> </person-group> (<year>2015</year>). <article-title>Biggest Neural Network Ever Pushes AI Deep Learning</article-title>. <source>IEEE Spectrum</source>. <comment>1122</comment></mixed-citation></ref>
<ref id="ref-127"><label>127.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>He</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Ren</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Deep Residual Learning for Image Recognition</article-title>. <comment>CoRR (Computing Research Repository), abs/1512.03385v1. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1512.03385v1">arXiv:1512.03385v1</ext-link>. See Footnote, <xref ref-type="fn" rid="fn337">337</xref>. 1123, 1124, 1125, 1135, 1168, 1170</comment></mixed-citation></ref>
<ref id="ref-128"><label>128.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Sedra</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Weinberger</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Deep Networks with Stochastic Depth</article-title>. <comment>CoRR (Computing Research Repository), abs/1603.09382v3. <ext-link ext-link-type="uri" xlink:href="https://arXiv:1603.09382v3">https://arxiv.org/abs/1603.09382v3</ext-link>. See Footnote <xref ref-type="fn" rid="fn337">337</xref>. 1124</comment></mixed-citation></ref>
<ref id="ref-129"><label>129.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zagoruyko</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Komodakis</surname>, <given-names>N.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Wide residual networks</article-title>. (<italic>Jun 17</italic>). <source>CoRR (Computing Research Repository)</source>, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1605.07146v4">arXiv:1605.07146v4</ext-link>. <comment>1124</comment></mixed-citation></ref>
<ref id="ref-130"><label>130.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Bishop</surname>, <given-names>C. M.</given-names></string-name></person-group> (<year>2006</year>). <source>Pattern Recognition and Machine Learning</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Springer Science+ Business Media</publisher-name>. <comment>1127, 1128, 1129, 1130, 1140, 1211, 1215, 1216, 1217, 1218, 1219, 1337, 1338, 1339</comment></mixed-citation></ref>
<ref id="ref-131"><label>131.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Maas</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Hannun</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2013</year>). <article-title>Rectifier nonlinearities improve neural network acoustic models</article-title>. <ext-link ext-link-type="uri" xlink:href="https://sites.google.com/site/deeplearningicml2013/">ICML Workshop on Deep Learning for Audio, Speech, and Language Processing (WDLASL 2013)</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://sites.google.com/site/deeplearningicml2013/accepted_papers">Accepted papers</ext-link>. <comment>See also</comment> <ext-link ext-link-type="uri" xlink:href="https://www.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.leakyrelulayer.html">leakyReluLayer, Leaky Rectified Linear Unit (ReLU) layer, MathWorks</ext-link>. <comment>1138</comment></mixed-citation></ref>
<ref id="ref-132"><label>132.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Li</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Taylor</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Studer</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Goldstein</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Visualizing the loss landscape of neural nets</article-title>. (<italic>Nov 7</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1712.09913v3">arXiv:1712.09913v3</ext-link>. <comment>1139</comment></mixed-citation></ref>
<ref id="ref-133"><label>133.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Geman</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Bienenstock</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Doursat</surname>, <given-names>R.</given-names></string-name></person-group> (<year>1992</year>). N<article-title>eural networks and the bias/variance dilemma</article-title>. <source>Neural computation</source>, <volume>4</volume>(<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>58</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.dam.brown.edu/people/documents/bias-variance.pdf">pdf</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.653.6987&#x0026;rep=rep1&#x0026;type=pdf">pdf</ext-link>. <comment>1140, 1142</comment></mixed-citation></ref>
<ref id="ref-134"><label>134.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Friedman</surname>, <given-names>J. H.</given-names></string-name></person-group> (<year>2001</year>). <source>The elements of statistical learning: Data mining, inference, prediction</source>. <edition>1st edition</edition>. <publisher-name>Springer</publisher-name>. <edition>2nd edition</edition>, <comment>corrected, 12 printing, 2017 Jan 13</comment>. <comment>1140</comment></mixed-citation></ref>
<ref id="ref-135"><label>135.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Prechelt</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1998</year>). <chapter-title>Early Stopping&#x2014;But When?</chapter-title> In <person-group person-group-type="editor"><string-name><given-names>G.</given-names> <surname>Orr</surname></string-name></person-group>, <person-group person-group-type="editor"><string-name><given-names>K.</given-names> <surname>Muller</surname></string-name></person-group>. <source>Neural Networds: Tricks of the Trade</source>. <publisher-name>Springer</publisher-name>. <comment>LLCS State-of-the-Art Survey</comment>. <ext-link ext-link-type="uri" xlink:href="http://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf">Paper pdf</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190214201939/http://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf">Internet archive</ext-link>. <comment>1141, 1142, 1143</comment></mixed-citation></ref>
<ref id="ref-136"><label>136.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Belkin</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hsu</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Mandal</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Reconciling modern machine-learning practice and the classical bias&#x2013;variance trade-off</article-title>. <source>Proceedings of the National Academy of Sciences</source>, <volume>116</volume>(<issue>32</issue>), <fpage>15849</fpage>&#x2013;<lpage>15854</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1073/pnas.1903070116">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1812.11118">arXiv:1812.11118</ext-link>. <comment>1143, 1144, 1145</comment></mixed-citation></ref>
<ref id="ref-137"><label>137.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Geiger</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Jacot</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Spigler</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Gabriel</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Sagun</surname>, <given-names>L.</given-names></string-name>, <etal>et al</etal>.</person-group> (<year>2020</year>). <article-title>Scaling description of generalization with number of parameters in deep learning</article-title>. <source>Journal of Statistical Mechanics: Theory and Experiment</source>, <volume>2020</volume>(<issue>2</issue>), <fpage>023401</fpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1088%2F1742-5468%2Fab633c">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1901.01608">arXiv:1901.01608</ext-link>. <comment>1143, 1145</comment></mixed-citation></ref>
<ref id="ref-138"><label>138.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Sampaio</surname>, <given-names>P.R. </given-names></string-name></person-group> (<year>2020</year>). <article-title>Deft-funnel: an open-source global optimization solver for constrained greybox and black-box problems</article-title>. (<italic>Jan 2020</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1812.11118">arXiv:1912.12637</ext-link>. <comment>1144</comment></mixed-citation></ref>
<ref id="ref-139"><label>139.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Polak</surname>, <given-names>E.</given-names></string-name></person-group> (<year>1971</year>). <source>Computational Methods in Optimization: A Unified Approach</source>. <publisher-name>Academic Press</publisher-name>. <comment>1146, 1147, 1148, 1149, 1150, 1151, 1152, 1158, 1189</comment></mixed-citation></ref>
<ref id="ref-140"><label>140.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lewis</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Torczon</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Trosset</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Direct search methods: then and now</article-title>. <source>Journal of Computational and Applied Mathematics</source>, <volume>124</volume> (<issue>1-2</issue>), <fpage>191</fpage>&#x2013;<lpage>207</lpage>. <comment>1146, 1152</comment></mixed-citation></ref>
<ref id="ref-141"><label>141.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kolda</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Lewis</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Torczon</surname>, <given-names>V.</given-names></string-name></person-group> (<year>2003</year>). <article-title>Optimization by direct search: New perspectives on some classical and modern methods</article-title>. <source>SIAM Review</source>, <volume>45</volume>(<issue>3</issue>), <fpage>385</fpage>&#x2013;<lpage>482</lpage>. <comment>1146, 1152</comment></mixed-citation></ref>
<ref id="ref-142"><label>142.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Kafka</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Wilke</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Gradient-only line searches: An alternative to probabilistic line searches</article-title>. (<italic>Mar 22</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1903.09383">arXiv:1903.09383</ext-link>. <comment>1146, 1193</comment></mixed-citation></ref>
<ref id="ref-143"><label>143.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mahsereci</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hennig</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Probabilistic line searches for stochastic optimization</article-title>. <source>Journal of Machine Learning Research</source>, <volume>18</volume>. <comment>Article No.1. Also, CoRR, abs/1703.10034v2, Jun 30</comment>. <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1703.10034v2">arXiv:1703.10034v2</ext-link>, <monospace>1703.10034</monospace>. <comment>1146, 1152, 1153, 1191</comment></mixed-citation></ref>
<ref id="ref-144"><label>144.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Paquette</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Scheinberg</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A stochastic line search method with convergence rate analysis</article-title>. (<italic>Jul 20</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1807.07994v1">arXiv:1807.07994v1</ext-link>. <comment>1146, 1149, 1150, 1151, 1153, 1155, 1185, 1187, 1188, 1189</comment></mixed-citation></ref>
<ref id="ref-145"><label>145.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Bergou</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Diouane</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Kungurtsev</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Royer</surname>, <given-names>C. W.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A subsampling line-search method with second-order results</article-title>. (<italic>Nov 21</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1810.07211v2">arXiv:1810.07211v2</ext-link>. <comment>1146, 1149, 1150, 1151, 1153, 1155, 1185, 1188, 1189, 1190, 1191, 1192</comment></mixed-citation></ref>
<ref id="ref-146"><label>146.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Wills</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sch&#x00C3;&#x00B6;n</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Stochastic quasi-newton with adaptive step lengths for large-scale problems</article-title>. (<italic>Feb 22</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1802.04310v1">arXiv:1802.04310v1</ext-link>. <comment>1146, 1188, 1191</comment></mixed-citation></ref>
<ref id="ref-147"><label>147.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mahsereci</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hennig</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Probabilistic line searches for stochastic optimization</article-title>. <source>CoRR</source>, (<italic>Feb 10</italic>). Abs/1502.02846. <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1502.02846">arXiv:1502.02846</ext-link>. <comment>1146, 1152</comment></mixed-citation></ref>
<ref id="ref-148"><label>148.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Luenberger</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Ye</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2016</year>). <source>Linear and Nonlinear Programming</source>. <comment>4th edition</comment>. <publisher-name>Springer</publisher-name>. <comment>1147, 1149, 1158</comment></mixed-citation></ref>
<ref id="ref-149"><label>149.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Polak</surname>, <given-names>E.</given-names></string-name></person-group> (<year>1997</year>). <source>Optimization: Algorithms and Consistent Approximations</source>. <publisher-name>Springer Verlag</publisher-name>. <comment>1147, 1148, 1149, 1152, 1158</comment></mixed-citation></ref>
<ref id="ref-150"><label>150.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Goldstein</surname>, <given-names>A.</given-names></string-name></person-group> (<year>1965</year>). <article-title>On steepest descent</article-title>. <source>SIAM Journal of Control, Series A</source>, <volume>3</volume>(<issue>1</issue>), <fpage>147</fpage>&#x2013;<lpage>151</lpage>. <comment>1147, 1148, 1149, 1152, 1153</comment></mixed-citation></ref>
<ref id="ref-151"><label>151.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Armijo</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1966</year>). <article-title>Minimization of functions having lipschitz continuous partial derivatives</article-title>. <source>Pacific Journal of Mathematics</source>, <volume>16</volume>(<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>3</lpage>. <comment>1147, 1148, 1149, 1153, 1189</comment></mixed-citation></ref>
<ref id="ref-152"><label>152.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wolfe</surname>, <given-names>P.</given-names></string-name></person-group> (<year>1969</year>). <article-title>Convergence conditions for ascent methods</article-title>. <source>SIAM Review</source>, <volume>11</volume>(<issue>2</issue>), <fpage>226</fpage>&#x2013;<lpage>235</lpage>. <comment>1147, 1149, 1152, 1153</comment></mixed-citation></ref>
<ref id="ref-153"><label>153.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wolfe</surname>, <given-names>P.</given-names></string-name></person-group> (<year>1971</year>). <article-title>Convergence conditions for ascent methods. II: Some corrections</article-title>. <source>SIAM Review</source>, <volume>13</volume>. <comment>1147, 1152</comment></mixed-citation></ref>
<ref id="ref-154"><label>154.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Goldstein</surname>, <given-names>A.</given-names></string-name></person-group> (<year>1967</year>). <source>Constructive Real Analysis</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Harper</publisher-name>. <comment>1147, 1152</comment></mixed-citation></ref>
<ref id="ref-155"><label>155.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Goldstein</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Price</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1967</year>). <article-title>An effective algorithm for minimization</article-title>. <source>Numerische Mathematik</source>, <volume>10</volume>, <fpage>184</fpage>&#x2013;<lpage>189</lpage>. <comment>1147, 1148, 1149</comment></mixed-citation></ref>
<ref id="ref-156"><label>156.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Ortega</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Rheinboldt</surname>, <given-names>W.</given-names></string-name></person-group> (<year>1970</year>).  <source>Iterative Solution of Nonlinear Equations in Several Variables</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Academic Press</publisher-name>. <comment>Republished in 2000 by SIAM, Classics in Applied Mathematics, Vol.30</comment>. <comment>1147, 1148, 1149, 1158</comment></mixed-citation></ref>
<ref id="ref-157"><label>157.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Nocedal</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wright</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2006</year>).  <source>Numerical Optimization</source>. <publisher-name>Springer</publisher-name>. <edition>2nd edition</edition>. <comment>1149, 1158</comment></mixed-citation></ref>
<ref id="ref-158"><label>158.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bollapragada</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Byrd</surname>, <given-names>R. H.</given-names></string-name>, <string-name><surname>Nocedal</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Exact and inexact subsampled Newton methods for optimization</article-title>. <source>IMA Journal of Numerical Analysis</source>, <volume>39</volume>(<issue>2</issue>), <fpage>545</fpage>&#x2013;<lpage>578</lpage>. <comment>1149</comment></mixed-citation></ref>
<ref id="ref-159"><label>159.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Berahas</surname>, <given-names>A. S.</given-names></string-name>, <string-name><surname>Byrd</surname>, <given-names>R. H.</given-names></string-name>, <string-name><surname>Nocedal</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Derivative-free optimization of noisy functions via quasi-newton methods</article-title>. <source>SIAM Journal on Optimization</source>, <volume>29</volume>(<issue>2</issue>), <fpage>965</fpage>&#x2013;<lpage>993</lpage>. <comment>1149</comment></mixed-citation></ref>
<ref id="ref-160"><label>160.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Larson</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Menickelly</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wild</surname>, <given-names>S. M.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Derivative-free optimization methods</article-title>. (<italic>Jun 25</italic>).  <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1904.11585v2">arXiv:1904.11585v2</ext-link>. <comment>1149</comment></mixed-citation></ref>
<ref id="ref-161"><label>161.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shi</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Shen</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2005</year>). <article-title>Step-size estimation for unconstrained optimization methods</article-title>. <source>Computational and Applied Mathematics</source>, <volume>24</volume>(<issue>3</issue>), <fpage>399</fpage>&#x2013;<lpage>416</lpage>. <comment>1152</comment></mixed-citation></ref>
<ref id="ref-162"><label>162.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Sun</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Cao</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2019</year>). <article-title>A survey of optimization methods from a machine learning perspective</article-title>. (<italic>Oct 23</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1906.06821v2">arXiv:1906.06821v2</ext-link>. <comment>1153, 1174, 1176, 1177</comment></mixed-citation></ref>
<ref id="ref-163"><label>163.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kirkpatrick</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Gelatt</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Vecchi</surname>, <given-names>M.</given-names></string-name></person-group> (<year>1983</year>). <article-title>Optimization by simulated annealing</article-title>. <source>Science</source>, <volume>220</volume>(<issue>4598</issue>), <fpage>671</fpage>&#x2013;<lpage>680</lpage>. <comment>1153, 1164, 1167</comment>; <pub-id pub-id-type="pmid">17813860</pub-id></mixed-citation></ref>
<ref id="ref-164"><label>164.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Smith</surname>, <given-names>S. L.</given-names></string-name>, <string-name><surname>Kindermans</surname>, <given-names>P. J.</given-names></string-name>, <string-name><surname>Ying</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Le</surname>, <given-names>Q. V.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Don&#x2019;t decay the learning rate, increase the batch size</article-title>. (<italic>Feb 2018</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1711.00489v2">arXiv:1711.00489v2</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=B1Yy1BxCZ">OpenReview</ext-link>. <comment>1153, 1161, 1163, 1164, 1165, 1166, 1182</comment></mixed-citation></ref>
<ref id="ref-165"><label>165.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Schraudolph</surname>, <given-names>N.</given-names></string-name></person-group> (<year>1998</year>). <chapter-title>Centering Neural Network Gradient Factors</chapter-title> In <person-group person-group-type="editor"><string-name><given-names>G.</given-names> <surname>Orr</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Muller</surname></string-name></person-group>. <source>Neural Networds: Tricks of the Trade</source>. <publisher-name>Springer</publisher-name>. <comment>LLCS State-of-the-Art Survey</comment>. <comment>1153, 1158, 1175</comment></mixed-citation></ref>
<ref id="ref-166"><label>166.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Neuneier</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Zimmermann</surname>, <given-names>H.</given-names></string-name></person-group> (<year>1998</year>). <chapter-title>How to Train Neural Networks</chapter-title> In <person-group person-group-type="editor"><string-name><given-names>G.</given-names><surname>Orr</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Muller</surname></string-name></person-group>. <source>Neural Networds: Tricks of the Trade</source>. <publisher-name>Springer</publisher-name>. <comment>LLCS State-of-the-Art Survey</comment>. <comment>1153, 1175</comment></mixed-citation></ref>
<ref id="ref-167"><label>167.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Robbins</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Monro</surname>, <given-names>S.</given-names></string-name></person-group> (<year>1951a</year>). <article-title>A stochastic approximation method</article-title>. <source>Annals of Mathematical Statistics</source>, <volume>22</volume>(<issue>3</issue>), <fpage>400</fpage>&#x2013;<lpage>407</lpage>. <comment>1153</comment></mixed-citation></ref>
<ref id="ref-168"><label>168.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Aitchison</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods</article-title>. (<italic>Jul 31</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1807.07540v4">arXiv:1807.07540v4</ext-link>. <comment>1155, 1175, 1185, 1186</comment></mixed-citation></ref>
<ref id="ref-169"><label>169.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Goudou</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Munier</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2009</year>). <article-title>The gradient and heavy ball with friction dynamical systems: The quasiconvex case</article-title>. <source>Mathematical Programming</source>, <volume>116</volume>(<issue>1-2</issue>), <fpage>173</fpage>&#x2013;<lpage>191</lpage>. <comment>7th French-Latin American Congress in Applied Mathematics, Univ Chile, Santiago, CHILE, JAN, 2005</comment>. <comment>1157, 1159</comment></mixed-citation></ref>
<ref id="ref-170"><label>170.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kingma</surname>, <given-names>D. P.</given-names></string-name>, <string-name><surname>Ba</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2014</year>). <article-title>Adam: A method for stochastic optimization</article-title>. (<italic>Dec 22</italic>). <comment>Version 1, 2014.12.22</comment>: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1412.6980v1">arXiv:1412.6980v1</ext-link>. <comment>Version 9, 2017.01.30</comment>: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1412.6980v9">arXiv:1412.6980v9</ext-link>. <comment>1158, 1170, 1173, 1174, 1178, 1179, 1180, 1181</comment></mixed-citation></ref>
<ref id="ref-171"><label>171.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Bertsekas</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Tsitsiklis</surname>, <given-names>J.</given-names></string-name></person-group> (<year>1995</year>). <source>Neuro-Dynamic Programming</source>. <publisher-name>Athena Scientific</publisher-name>. <comment>1158, 1159</comment></mixed-citation></ref>
<ref id="ref-172"><label>172.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2012</year>). <chapter-title>A Practical Guide to Training Restricted Boltzmann Machines</chapter-title> In <person-group person-group-type="editor"><string-name><given-names>G.</given-names> <surname>Montavon</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Orr</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Muller</surname></string-name></person-group>. <source>Neural Networds: Tricks of the Trade</source>. <publisher-name>Springer</publisher-name>. <comment>LLCS State-of-the-Art Survey</comment>. <comment>1158</comment></mixed-citation></ref>
<ref id="ref-173"><label>173.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Incerti</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Parisi</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Zirilli</surname>, <given-names>F.</given-names></string-name></person-group> (<year>1979</year>). <article-title>New method for solving non-linear simultaneous equations</article-title>. <source>SIAM Journal on Numerical Analysis</source>, <volume>16</volume>(<issue>5</issue>), <fpage>779</fpage>&#x2013;<lpage>789</lpage>. <comment>1158</comment></mixed-citation></ref>
<ref id="ref-174"><label>174.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Voigt</surname>, <given-names>R.</given-names></string-name></person-group> (<year>1971</year>). <article-title>Rates of convergence for a class of iterative procedures</article-title>. <source>SIAM Journal on Numerical Analysis</source>, <volume>8</volume>(<issue>1</issue>), <fpage>127</fpage>&#x2013;<lpage>134</lpage>. <comment>1158</comment></mixed-citation></ref>
<ref id="ref-175"><label>175.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Plaut</surname>, <given-names>D. C.</given-names></string-name>, <string-name><surname>Nowlan</surname>, <given-names>S. J.</given-names></string-name>, <string-name><surname>Hinton</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>1986</year>). <article-title>Experiments on learning by back propagation</article-title>. <comment>Technical Report Technical Report CMU-CS-86-126, June</comment>. <ext-link ext-link-type="uri" xlink:href="https://www.semanticscholar.org/paper/Experiments-on-Learning-by-Back-Propagation.-Plaut-Nowlan/4a42b2104ca8ff891ae77c40a915d4c94c8f8428">Website</ext-link>. <comment>1158, 1167</comment></mixed-citation></ref>
<ref id="ref-176"><label>176.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jacobs</surname>, <given-names>R.</given-names></string-name></person-group> (<year>1988</year>). <article-title>Increased rates of convergence through learning rate adaptation</article-title>. <source>Neural Networks</source>, <volume>1</volume>(<issue>4</issue>), <fpage>295</fpage>&#x2013;<lpage>307</lpage>. <comment>1158</comment></mixed-citation></ref>
<ref id="ref-177"><label>177.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Hagiwara</surname>, <given-names>M.</given-names></string-name></person-group> (<year>1992</year>). <article-title>Theoretical derivation of momentum term in back-propagation</article-title> In <source>Proceedings of the International Joint Conference on Neural Networks (IJCNN&#x2019;92)</source>. <comment>volume 1</comment>. <publisher-loc>Piscataway, NJ</publisher-loc>, <publisher-name>IEEE</publisher-name>. <comment>1158</comment></mixed-citation></ref>
<ref id="ref-178"><label>178.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Gill</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Murray</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Wright</surname>, <given-names>M.</given-names></string-name></person-group> (<year>1981</year>). <source>Practical Optimization. Academic Press</source>. <comment>1158</comment></mixed-citation></ref>
<ref id="ref-179"><label>179.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Snyman</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wilke</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2018</year>). <source>Practical Mathematical Optimization: Basic optimization theory and gradient-based algorithms</source>. <publisher-name>Springer</publisher-name>. <comment>1158, 1193</comment></mixed-citation></ref>
<ref id="ref-180"><label>180.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Priddy</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Keller</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2005</year>). <source>Artificial neural network: An introduction</source>. <publisher-name>SPIE</publisher-name>. <comment>1159</comment></mixed-citation></ref>
<ref id="ref-181"><label>181.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sutskever</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Martens</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Dahl</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2013</year>). <article-title>On the importance of initialization and momentum in deep learning</article-title>. <source>Proceedings of the 30th International Conference on Machine Learning, PMLR</source>, <volume>28</volume>(<issue>3</issue>). <ext-link ext-link-type="uri" xlink:href="http://proceedings.mlr.press/v28/sutskever13.html">Original website</ext-link>. <comment>1159</comment></mixed-citation></ref>
<ref id="ref-182"><label>182.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Reddi</surname>, <given-names>S. J.</given-names></string-name>, <string-name><surname>Kale</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2019</year>). <article-title>On the convergence of Adam and beyond</article-title>. (<italic>Oct 23</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1904.09237">arXiv:1904.09237</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=ryQu7f-RZ">OpenReview</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://medium.com/syncedreview/iclr-2018s-best-papers-variant-adam-spherical-cnns-and-meta-learning-6b48dca83e8b">paper ICLR 2018</ext-link>. <comment>1160, 1170, 1172, 1174, 1178, 1179, 1180, 1181, 1191</comment></mixed-citation></ref>
<ref id="ref-183"><label>183.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Phuong</surname>, <given-names>T. T.</given-names></string-name>, <string-name><surname>Phong</surname>, <given-names>L. T.</given-names></string-name></person-group> (<year>2019</year>). <article-title>On the convergence proof of AMSGrad and a new version</article-title>. (<italic>Oct 31</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1904.03590v4">arXiv:1904.03590v4</ext-link>. <comment>1160, 1180, 1181</comment></mixed-citation></ref>
<ref id="ref-184"><label>184.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Li</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Orabona</surname>, <given-names>F.</given-names></string-name></person-group> (<year>2019</year>). <article-title>On the convergence of stochastic gradient descent with adaptive stepsizes</article-title>. (<italic>Feb 26</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1805.08114v3">arXiv:1805.08114v3</ext-link>. <comment>1161</comment></mixed-citation></ref>
<ref id="ref-185"><label>185.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Gardiner</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2004</year>). <source>Handbook of Stochastic Methods: for Physics, Chemistry and the Natural Sciences</source>. <comment>Synergetics</comment>, <edition>3rd edition</edition>. <publisher-name>Springer</publisher-name>. <comment>1162, 1165, 1166</comment></mixed-citation></ref>
<ref id="ref-186"><label>186.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Smith</surname>, <given-names>S. L.</given-names></string-name>, <string-name><surname>Le</surname>, <given-names>Q. V.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A bayesian perspective on generalization and stochastic gradient descent</article-title>. (<italic>Feb 2018</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1710.06451v3">arXiv:1710.06451v3</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=BJij4yg0Z">OpenReview</ext-link>. <comment>1163, 1164</comment></mixed-citation></ref>
<ref id="ref-187"><label>187.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Li</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Tai</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>E</surname>, <given-names>W.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Stochastic modified equations and adaptive stochastic gradient algorithms</article-title>. (<italic>Jun 20</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1511.06251v3">arXiv:1511.06251v3</ext-link>. <ext-link ext-link-type="uri" xlink:href="http://proceedings.mlr.press/v70/li17f.html">Proceedings of Machine Learning Research, 70:2101-2110, 2017</ext-link>. <comment>1165</comment></mixed-citation></ref>
<ref id="ref-188"><label>188.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lemons</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Gythiel</surname>, <given-names>A.</given-names></string-name></person-group> (<year>1997</year>). <article-title>Paul Langevin&#x2019;s 1908 paper &#x201C;On the theory of Brownian motion&#x201D;</article-title>. <source>American Journal of Physics</source>, <volume>65</volume>(<issue>11</issue>), <fpage>1079</fpage>&#x2013;<lpage>1081</lpage>. <comment>1166, 1167</comment></mixed-citation></ref>
<ref id="ref-189"><label>189.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Coffey</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Kalmikov</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Waldron</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2004</year>). <source>The Langevin Equation</source>. <edition>2nd edition</edition>. <publisher-name>World Scientific</publisher-name>. <comment>1166</comment></mixed-citation></ref>
<ref id="ref-190"><label>190.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Lones</surname>, <given-names>M. A.</given-names></string-name></person-group> <article-title>Metaheuristics in nature-inspired algorithms</article-title>. In <source>Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation</source>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-191"><label>191.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Yang</surname>, <given-names>X. S.</given-names></string-name></person-group> (<year>2014</year>). <source>Nature-inspired optimization algorithms</source>. <publisher-name>Elsevier</publisher-name>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-192"><label>192.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rere</surname>, <given-names>L. R.</given-names></string-name>, <string-name><surname>Fanany</surname>, <given-names>M. I.</given-names></string-name>, <string-name><surname>Arymurthy</surname>, <given-names>A. M.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Simulated annealing algorithm for deep learning</article-title>. <source>Procedia Computer Science</source>, <volume>72</volume>(<issue>1</issue>), <fpage>137</fpage>&#x2013;<lpage>144</lpage>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-193"><label>193.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rere</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Fanany</surname>, <given-names>M. I.</given-names></string-name>, <string-name><surname>Arymurthy</surname>, <given-names>A. M.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Metaheuristic algorithms for convolution neural network</article-title>. <source>Computational Intelligence and Neuroscience</source>, <volume>2016</volume>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-194"><label>194.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Fong</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Deb</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>X.</given-names></string-name></person-group> (<year>2018</year>). <article-title>How meta-heuristic algorithms contribute to deep learning in the hype of big data analytics</article-title>. In <source>Progress in Intelligent Computing Techniques: Theory, Practice, and Applications</source>. <publisher-name>Springer</publisher-name>, <fpage>3</fpage>&#x2013;<lpage>25</lpage>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-195"><label>195.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Bozorg-Haddad</surname>, <given-names>O.</given-names></string-name></person-group> (<year>2018</year>). <source>Advanced optimization by nature-inspired algorithms</source>. <publisher-name>Springer</publisher-name>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-196"><label>196.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Al-Obeidat</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Belacel</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Spencer</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Combining machine learning and metaheuristics algorithms for classification method proaftn</article-title>. In <source>Enhanced Living Environments</source>. <publisher-name>Springer</publisher-name>, <fpage>53</fpage>&#x2013;<lpage>79</lpage>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-197"><label>197.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bui</surname>, <given-names>Q. T.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Metaheuristic algorithms in optimizing neural network: A comparative study for forest fire susceptibility mapping in Dak Nong, Vietnam</article-title>. <source>Geomatics, Natural Hazards and Risk</source>, <volume>10</volume>(<issue>1</issue>), <fpage>136</fpage>&#x2013;<lpage>150</lpage>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-198"><label>198.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Devikanniga</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Vetrivel</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Badrinath</surname>, <given-names>N.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Review of meta-heuristic optimization based artificial neural networks and its applications</article-title>. In <source>Journal of Physics: Conference Series</source>. volume 1362. <publisher-name>IOP Publishing</publisher-name>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-199"><label>199.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mirjalili</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Dong</surname>, <given-names>J. S.</given-names></string-name>, <string-name><surname>Lewis</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2020</year>). <source>Nature-Inspired Optimizers</source>. <publisher-name>Springer</publisher-name>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-200"><label>200.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Smith</surname>, <given-names>L. N.</given-names></string-name>, <string-name><surname>Topin</surname>, <given-names>N.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Super-convergence: Very fast training of residual networks using large learning rates</article-title>. (<italic>May 2018</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1708.07120v3">arXiv:1708.07120v3</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=H1A5ztj3b">OpenReview</ext-link>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-201"><label>201.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>R&#x00C3;&#x00B6;gnvaldsson</surname>, <given-names>T. S.</given-names></string-name></person-group> (<year>1998</year>). <chapter-title>A Simple Trick for Estimating the Weight Decay Parameter</chapter-title>In <person-group person-group-type="editor"><string-name><given-names>G.</given-names> <surname>Orr</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Muller</surname></string-name></person-group>. <source>Neural Networds: Tricks of the Trade</source>. <publisher-name>Springer</publisher-name>. <comment>LLCS State-of-the-Art Survey</comment>. <comment>1167</comment></mixed-citation></ref>
<ref id="ref-202"><label>202.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Glorot</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2010</year>). <article-title>Understanding the difficulty of training deep feedforward neural networks</article-title>In <source>Proceedings of the thirteenth international conference on artificial intelligence and statistics</source>. <conf-name>JMLR Workshop and Conference Proceedings</conf-name>. <comment>1168</comment></mixed-citation></ref>
<ref id="ref-203"><label>203.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Bock</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Goppold</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Weiss</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2018</year>). <article-title>An improvement of the convergence proof of the ADAMoptimizer</article-title>. (<italic>Apr 27</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1804.10587v1">arXiv:1804.10587v1</ext-link>. <comment>1170, 1180, 1181</comment></mixed-citation></ref>
<ref id="ref-204"><label>204.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Huang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Dong</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate</article-title>. (<italic>Feb 23</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1805.07557v2">arXiv:1805.07557v2</ext-link>. <comment>1170, 1181</comment></mixed-citation></ref>
<ref id="ref-205"><label>205.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Hong</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2019</year>). <article-title>On the convergence of a class of Adam-type algorithms for non-convex optimization</article-title>. (<italic>Mar 10</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1808.02941v2">arXiv:1808.02941v2</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=H1x-x309tm">OpenReview</ext-link>. <comment>1174</comment></mixed-citation></ref>
<ref id="ref-206"><label>206.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Hyndman</surname>, <given-names>R. J.</given-names></string-name>, <string-name><surname>Koehler</surname>, <given-names>A. B.</given-names></string-name>, <string-name><surname>Ord</surname>, <given-names>J. K.</given-names></string-name>, <string-name><surname>Snyder</surname>, <given-names>R. D.</given-names></string-name></person-group> (<year>2008</year>). <source>Forecasting with Exponential Smoothing: A state state approach</source>. <publisher-name>Springer</publisher-name>. <comment>1174</comment></mixed-citation></ref>
<ref id="ref-207"><label>207.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Hyndman</surname>, <given-names>R. J.</given-names></string-name>, <string-name><surname>Athanasopoulos</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2018</year>). <source>Forecasting: Principles and Practices</source>. <edition>2nd edition</edition>. <publisher-name>OTexts</publisher-name>: <publisher-loc>Melbourne, Australia</publisher-loc>. <ext-link ext-link-type="uri" xlink:href="https://otexts.com/fpp2/">Original website</ext-link>, <comment>open online text</comment>. <comment>1175, 1176</comment></mixed-citation></ref>
<ref id="ref-208"><label>208.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dreiseitl</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Ohno-Machado</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2002</year>). <article-title>Logistic regression and artificial neural network classification models: a methodology review</article-title>. <source>Journal of Biomedical Informatics</source>, <volume>35</volume>, <fpage>352</fpage>&#x2013;<lpage>359</lpage>. <comment>1180</comment>; <pub-id pub-id-type="pmid">12968784</pub-id></mixed-citation></ref>
<ref id="ref-209"><label>209.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Gugger</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Howard</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2018</year>). <article-title>AdamW and Super-convergence is now the fastest way to train neural nets</article-title>. <source>Fast.AI</source>, (<italic>Jul 02</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.fast.ai/2018/07/02/adam-weight-decay/">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20191206104130/https://www.fast.ai/2018/07/02/adam-weight-decay/">Internet Archive</ext-link>. <comment>1181, 1185</comment></mixed-citation></ref>
<ref id="ref-210"><label>210.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Xing</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Arpit</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Tsirigotis</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2018</year>). <article-title>A walk with sgd</article-title>. <source>Fast.AI</source>, (<italic>May 2018</italic>). <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1802.08770v4">arXiv:1802.08770v4</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=B1l6e3RcF7">OpenReview</ext-link>. <comment>1182</comment></mixed-citation></ref>
<ref id="ref-211"><label>211.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Prokhorov</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2001</year>). <article-title>IJCNN 2001 neural network competition. Slide presentation in IJCNN&#x2019;01</article-title>, <comment>Ford Research Laboratory, 2001</comment> <ext-link ext-link-type="uri" xlink:href="http://web.archive.org/web/20191216172229/http://www.geocities.ws/ijcnn/nnc_ijcnn01.pdf">Internet Archive</ext-link>. <comment>1191</comment></mixed-citation></ref>
<ref id="ref-212"><label>212.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chang</surname>, <given-names>C. C.</given-names></string-name>, <string-name><surname>Lin</surname>, <given-names>C. J.</given-names></string-name></person-group> (<year>2011</year>). <article-title>LIBSVM: A Library for Support Vector Machines</article-title>. <source>ACM Transactions on Intelligent Systems and Technology</source>, <volume>2</volume>(<issue>3</issue>). <comment>Article 27, April 2011</comment>. <ext-link ext-link-type="uri" xlink:href="https://www.csie.ntu.edu.tw/~cjlin/libsvm/">Original website for software</ext-link> <comment>(Version 3.24 released 2019.09.11)</comment>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190926060633/https://www.csie.ntu.edu.tw/~cjlin/libsvm/">Internet Archive</ext-link>. <comment>1191</comment></mixed-citation></ref>
<ref id="ref-213"><label>213.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Brogan</surname>, <given-names>W. L.</given-names></string-name></person-group> (<year>1990</year>). <source>Modern Control Theory</source>. <edition>3rd edition</edition>. <publisher-name>Pearson</publisher-name>. <comment>1193</comment></mixed-citation></ref>
<ref id="ref-214"><label>214.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hopfield</surname>, <given-names>J. J.</given-names></string-name></person-group> (<year>1984</year>). <article-title>Neurons with graded response have collective computational properties like those of two-state neurons</article-title>. <source>Proceedings of the National Academy of Sciences</source>, <volume>81</volume>(<issue>10</issue>), <fpage>3088</fpage>&#x2013;<lpage>3092</lpage>. <ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/cgi/doi/10.1073/pnas.81.10.3088">Original website</ext-link>. <comment>1193</comment></mixed-citation></ref>
<ref id="ref-215"><label>215.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pineda</surname>, <given-names>F. J.</given-names></string-name></person-group> (<year>1987</year>). <article-title>Generalization of back-propagation to recurrent neural networks</article-title>. <source>Physical Review Letters</source>, <volume>59</volume>(<issue>19</issue>), <fpage>2229</fpage>&#x2013;<lpage>2232</lpage>. <comment>1193</comment>; <pub-id pub-id-type="pmid">10035458</pub-id></mixed-citation></ref>
<ref id="ref-216"><label>216.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Newmark</surname>, <given-names>N. M.</given-names></string-name></person-group> (<year>1959</year>). <source>A Method of Computation for Structural Dynamics</source>. <comment>Number 85 in A Method of Computation for Structural Dynamics</comment>. <publisher-name>American Society of Civil Engineers</publisher-name>. <comment>1194</comment></mixed-citation></ref>
<ref id="ref-217"><label>217.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hilber</surname>, <given-names>H. M.</given-names></string-name>, <string-name><surname>Hughes</surname>, <given-names>T. J.</given-names></string-name>, <string-name><surname>Taylor</surname>, <given-names>R. L.</given-names></string-name></person-group> (<year>1977</year>). <article-title>Improved numerical dissipation for time integration algorithms in structural dynamics</article-title>. <source>Earthquake Engineering &amp; Structural Dynamics</source>, <volume>5</volume>(<issue>3</issue>), <fpage>283</fpage>&#x2013;<lpage>292</lpage>. <ext-link ext-link-type="uri" xlink:href="http://doi.wiley.com/10.1002/eqe.4290050306">Original website</ext-link>. <comment>1194</comment></mixed-citation></ref>
<ref id="ref-218"><label>218.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chung</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hulbert</surname>, <given-names>G. M.</given-names></string-name></person-group> (<year>1993</year>). <article-title>A Time Integration Algorithm for Structural Dynamics With Improved Numerical Dissipation: The Generalized-&#x03B1; Method</article-title>. <source>Journal of Applied Mechanics</source>, <volume>60</volume>(<issue>2</issue>), <fpage>371</fpage>. <ext-link ext-link-type="uri" xlink:href="https://appliedmechanics.asmedigitalcollection.asme.org/article.aspx?articleid=1410995">Original website</ext-link>. <comment>1194</comment></mixed-citation></ref>
<ref id="ref-219"><label>219.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Olah</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2015</year>). U<article-title>nderstanding LSTM Networks</article-title>. <source>colah&#x02BC;s blog</source>, (<italic>Aug 27</italic>). <ext-link ext-link-type="uri" xlink:href="https://colah.github.io/posts/2015-08-Understanding-LSTMs">Original website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20190130114312/https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Internet archive</ext-link>. <comment>1199</comment></mixed-citation></ref>
<ref id="ref-220"><label>220.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Cho</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>van Merrienboer</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Bahdanau</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2014</year>). <article-title>On the properties of neural machine translation: Encoder-decoder approaches</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1409.1259">arXiv:1409.1259</ext-link>. <comment>1202</comment></mixed-citation></ref>
<ref id="ref-221"><label>221.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chung</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Gulcehre</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Cho</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2014</year>). <article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1412.3555">arXiv:1412.3555</ext-link>. <comment>1202, 1203</comment></mixed-citation></ref>
<ref id="ref-222"><label>222.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Denton</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Hoang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Rush</surname>, <given-names>A. M.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Structured attention networks</article-title>. <conf-name>International Conference on Learning Representations</conf-name>, OpenReview.net, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1702.00887">arXiv:1702.00887</ext-link>. <comment>1203</comment></mixed-citation></ref>
<ref id="ref-223"><label>223.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Cho</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>van Merri&#x00EB;nboer</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Bahdanau</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> <publisher-loc>Doha, Qatar</publisher-loc>. <comment>1204</comment></mixed-citation></ref>
<ref id="ref-224"><label>224.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Schuster</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Paliwal</surname>, <given-names>K. K.</given-names></string-name></person-group> (<year>1997</year>). <article-title>Bidirectional recurrent neural networks</article-title>. <source>IEEE transactions on Signal Processing</source>, <volume>45</volume>(<issue>11</issue>), <fpage>2673</fpage>&#x2013;<lpage>2681</lpage>. <comment>1205</comment></mixed-citation></ref>
<ref id="ref-225"><label>225.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ba</surname>, <given-names>J. L.</given-names></string-name>, <string-name><surname>Kiros</surname>, <given-names>J. R.</given-names></string-name>, <string-name><surname>Hinton</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Layer normalization</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1607.06450">arXiv:1607.06450</ext-link>. <comment>1209</comment></mixed-citation></ref>
<ref id="ref-226"><label>226.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Brown</surname>, <given-names>T. B.</given-names></string-name>, <string-name><surname>Mann</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Ryder</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Subbiah</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kaplan</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Language models are few-shot learners</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2005.14165v4">arXiv:2005.14165v4</ext-link>. <comment>1211</comment></mixed-citation></ref>
<ref id="ref-227"><label>227.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tsai</surname>, <given-names>Y. H. H.</given-names></string-name>, <string-name><surname>Bai</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Yamada</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Morency</surname>, <given-names>L. P.</given-names></string-name>, <string-name><surname>Salakhutdinov</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Transformer dissection: A unified understanding of transformer&#x2019;s attention via the lens of kernel</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1908.11775">arXiv:1908.11775</ext-link>. <comment>1211</comment></mixed-citation></ref>
<ref id="ref-228"><label>228.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rodriguez-Torrado</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Ruiz</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Cueto-Felgueroso</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Green</surname>, <given-names>M. C.</given-names></string-name>, <string-name><surname>Friesen</surname>, <given-names>T.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2022</year>). <article-title>Physics-informed attention-based neural network for hyperbolic partial differential equations: application to the buckley&#x2013;leverett problem</article-title>. <source>Scientific Reports</source>, <volume>12</volume>(<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/s41598-022-11058-2">Original website</ext-link>. <comment>1211, 1230</comment></mixed-citation></ref>
<ref id="ref-229"><label>229.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bahri</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Towards an Understanding of Wide, Deep Neural Networks</article-title>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=V8iZOpY28_E&#x0026;t=223s">Youtube</ext-link>. <comment>1211</comment></mixed-citation></ref>
<ref id="ref-230"><label>230.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Ananthaswamy</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2021</year>). <article-title>A New Link to an Old Model Could Crack the Mystery of Deep Learning</article-title>. <source>Quanta Magazine</source>, (<italic>Oct 11</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/s41598-022-11058-2">Original website</ext-link>. <comment>1211</comment></mixed-citation></ref>
<ref id="ref-231"><label>231.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lee</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Bahri</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Novak</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Schoenholz</surname>, <given-names>S. S.</given-names></string-name>, <string-name><surname>Pennington</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2018</year>). <article-title>Deep neural networks as gaussian processes</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1711.00165">arXiv:1711.00165</ext-link>. <comment>1211, 1305</comment></mixed-citation></ref>
<ref id="ref-232"><label>232.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Jacot</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gabriel</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Hongler</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Neural tangent kernel: Convergence and generalization in neural networks</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1806.07572">arXiv:1806.07572</ext-link>. <comment>1211, 1212, 1230</comment></mixed-citation></ref>
<ref id="ref-233"><label>233.</label><mixed-citation publication-type="web"><article-title>2021&#x2019;s Biggest Breakthroughs in Math and Computer Science</article-title>. <publisher-name>Quanta Magazine</publisher-name>, <year>2021</year> <month>Dec</month> <day>31</day>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=9uASADiYe_8">Youtube</ext-link>. <comment>1211, 1304</comment></mixed-citation></ref>
<ref id="ref-234"><label>234.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Rasmussen</surname>, <given-names>C. E.</given-names></string-name>, <string-name><surname>Williams</surname>, <given-names>C. K.</given-names></string-name></person-group> (<year>2006</year>). <source>Gaussian processes for machine learning</source>. <publisher-name>MIT press Cambridge</publisher-name>, <publisher-loc>MA</publisher-loc>. <ext-link ext-link-type="uri" xlink:href="https://direct.mit.edu/books/book/2320/Gaussian-Processes-for-Machine-Learning">MIT website</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://gaussianprocess.org/gpml/chapters/RW.pdf">GaussianProcess.org</ext-link>. <comment>1211, 1215, 1216, 1217, 1219, 1339</comment></mixed-citation></ref>
<ref id="ref-235"><label>235.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Belkin</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Mandal</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2018</year>). <article-title>To understand deep learning we need to understand kernel learning</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1802.0139">arXiv:1802.0139</ext-link>. <comment>1212, 1215</comment></mixed-citation></ref>
<ref id="ref-236"><label>236.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lee</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Schoenholz</surname>, <given-names>S. S.</given-names></string-name>, <string-name><surname>Pennington</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Adlam</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Xiao</surname>, <given-names>L.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Finite versus infinite neural networks: an empirical study</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2007.15801">arXiv:2007.15801</ext-link>. <comment>1212</comment></mixed-citation></ref>
<ref id="ref-237"><label>237.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Aronszajn</surname>, <given-names>N.</given-names></string-name></person-group> (<year>1950</year>). <article-title>Theory of reproducing kernels</article-title>. <source>Transactions of the American mathematical society</source>, <volume>68</volume>(<issue>3</issue>), <fpage>337</fpage>&#x2013;<lpage>404</lpage>. <comment>1212, 1215</comment></mixed-citation></ref>
<ref id="ref-238"><label>238.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Friedman</surname></string-name>, <string-name><surname>Friedman</surname>, <given-names>J. H.</given-names></string-name></person-group> (<year>2017</year>). <source>The elements of statistical learning: Data mining, inference, and prediction</source>. <edition>2 edition</edition>. <publisher-name>Springer</publisher-name>. <comment>Corrected, 12th printing, Jan 13</comment>. <comment>1213, 1214, 1215</comment></mixed-citation></ref>
<ref id="ref-239"><label>239.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Evgeniou</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Pontil</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Poggio</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Regularization networks and support vector machines</article-title>. <source>Advances in computational mathematics</source>, <volume>13</volume>(<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>50</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.semanticscholar.org/paper/Regularization-Networks-and-Support-Vector-Machines-Evgeniou-Pontil/d2d13bc44e15fd93480e16305d37c025bc0818c2">Semantic Scholar</ext-link>. <comment>1213, 1214, 1215</comment></mixed-citation></ref>
<ref id="ref-240"><label>240.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Berlinet</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Thomas-Agnan</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2004</year>). <source>Reproducing kernel Hilbert spaces in probability and statistics</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Springer Science &#x0026; Business Media</publisher-name>. <comment>1214, 1215, 1216</comment></mixed-citation></ref>
<ref id="ref-241"><label>241.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Girosi</surname>, <given-names>F.</given-names></string-name></person-group> (<year>1998</year>). <article-title>An equivalence between sparse approximation and support vector machines</article-title>. <source>Neural computation</source>, <volume>10</volume>(<issue>6</issue>), <fpage>1455</fpage>&#x2013;<lpage>1480</lpage>. <ext-link ext-link-type="uri" xlink:href="https://direct.mit.edu/neco/article/10/6/1455/6181/An-Equivalence-Between-Sparse-Approximation-and">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://www.semanticscholar.org/paper/An-Equivalence-Between-Sparse-Approximation-and-Girosi/d27c7569fdbcbb57ff511f5293e32b547acca7b3">Semantic Scholar</ext-link>. <comment>1214, 1215</comment>; <pub-id pub-id-type="pmid">9698353</pub-id></mixed-citation></ref>
<ref id="ref-242"><label>242.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Wahba</surname>, <given-names>G.</given-names></string-name></person-group> (<year>1990</year>). <source>Spline Models for Observational Data</source>. <publisher-loc>Philadelphia, Pennsylvania</publisher-loc>: <publisher-name>SIAM</publisher-name>. <comment>4th printing 2002</comment>. <comment>1215</comment></mixed-citation></ref>
<ref id="ref-243"><label>243.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Adler</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Hilbert spaces and the Riesz Representation Theorem</article-title>. <comment>The University of Chicago Mathematics REU 2021</comment>, <ext-link ext-link-type="uri" xlink:href="http://math.uchicago.edu/~may/REU2021/REUPapers/Adler.pdf">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220120165948/http://math.uchicago.edu/~may/REU2021/REUPapers/Adler.pdf">Internet archive</ext-link>. <comment>1215</comment></mixed-citation></ref>
<ref id="ref-244"><label>244.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Schaback</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Wendland</surname>, <given-names>H.</given-names></string-name></person-group> (<year>2006</year>). <article-title>Kernel techniques: From machine learning to meshless methods</article-title>. <source>Acta numerica</source>, <volume>15</volume>, <fpage>543</fpage>&#x2013;<lpage>639</lpage>. <comment>1215</comment></mixed-citation></ref>
<ref id="ref-245"><label>245.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Yaida</surname>, <given-names>S.</given-names></string-name></person-group> <article-title>Non-gaussian processes and neural networks at finite widths</article-title>. In <source>Mathematical and Scientific Machine Learning</source>. <comment>1216</comment></mixed-citation></ref>
<ref id="ref-246"><label>246.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Sendera</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Tabor</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Nowak</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Bedychaj</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Patacchiola</surname>, <given-names>M.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Non-gaussian gaussian processes for few-shot regression</article-title>. <source>Advances in Neural Information Processing Systems</source>, <source>34</source>, <fpage>10285</fpage>&#x2013;<lpage>10298</lpage>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2110.13561">arXiv:2110.13561</ext-link>. <comment>1216</comment></mixed-citation></ref>
<ref id="ref-247"><label>247.</label><mixed-citation publication-type="thesis"><person-group person-group-type="author"><string-name><surname>Duvenaud</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2014</year>). <article-title>Automatic model construction with Gaussian processes</article-title>. <comment>Ph.D. thesis, University of Cambridge. PhD dissertation</comment>. <ext-link ext-link-type="uri" xlink:href="https://www.repository.cam.ac.uk/handle/1810/247281">Thesis repository</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by-sa/2.0/uk/">CC BY-SA 2.0 UK</ext-link>. <comment>1219, 1220</comment></mixed-citation></ref>
<ref id="ref-248"><label>248.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>von Mises</surname>, <given-names>R.</given-names></string-name></person-group> (<year>1964</year>). <source>Mathematical theory of probability and statistics</source>. <publisher-name>Elsevier</publisher-name>. <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/book/9781483232133/mathematical-theory-of-probability-and-statistics">Book site</ext-link>. <comment>1219, 1339</comment></mixed-citation></ref>
<ref id="ref-249"><label>249.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hale</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Deep Learning Framework Power Scores 2018</article-title>. <source>Towards Data Science</source>, (<italic>Sep 19</italic>). <ext-link ext-link-type="uri" xlink:href="https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a">Original website</ext-link>. <comment>Internet archive</comment>. <comment>1220, 1222</comment></mixed-citation></ref>
<ref id="ref-250"><label>250.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Abadi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Agarwal</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Barham</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Brevdo</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>Z.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2015</year>). <article-title>TensorFlow: Large-scale machine learning on heterogeneous systems</article-title>. <ext-link ext-link-type="uri" xlink:href="http://download.tensorflow.org/paper/whitepaper2015.pdf">Whitepaper pdf</ext-link>, <comment>Software available from</comment> <ext-link ext-link-type="uri" xlink:href="https://www.tensorflow.org/">tensorflow.org</ext-link>. <comment>1222</comment></mixed-citation></ref>
<ref id="ref-251"><label>251.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Jouppi, N</surname></string-name></person-group>. <article-title>Google supercharges machine learning tasks with TPU custom chip</article-title>. <ext-link ext-link-type="uri" xlink:href="https://cloud.google.com/blog/products/ai-machine-learning/google-supercharges-machine-learning-tasks-with-custom-chip">Original website</ext-link>. <comment>1222</comment></mixed-citation></ref>
<ref id="ref-252"><label>252.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Chollet</surname>, <given-names>F.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2015</year>). <article-title>Keras</article-title>. <ext-link ext-link-type="uri" xlink:href="https://keras.io">Original website</ext-link>. <comment>1223</comment></mixed-citation></ref>
<ref id="ref-253"><label>253.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Paszke</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gross</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Massa</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Lerer</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Bradbury</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2019</year>). <chapter-title>Pytorch: An imperative style, high-performance deep learning library</chapter-title>. In <string-name><given-names>H.</given-names> <surname>Wallach</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Larochelle</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Beygelzimer</surname></string-name>, <string-name><given-names>F.</given-names> <surname>d&#x2019;Alch&#x00E9;-Buc</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Fox</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Garnett</surname></string-name>, <comment>editors</comment>, <source>Advances in Neural Information Processing Systems 32</source>. <publisher-name>Curran Associates, Inc.</publisher-name>, <fpage>8024</fpage>&#x2013;<lpage>8035</lpage>. <ext-link ext-link-type="uri" xlink:href="http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf">Paper pdf</ext-link>. <comment>1223</comment></mixed-citation></ref>
<ref id="ref-254"><label>254.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Chintala</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Decisions and pivots on pytorch</article-title>. <comment>2022 Jan 19</comment>, <ext-link ext-link-type="uri" xlink:href="https://soumith.ch/posts/2022/01/pytorch-retro/">Original website</ext-link> <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220913105457/https://soumith.ch/posts/2022/01/pytorch-retro/">Internet archive</ext-link>. <comment>1223</comment></mixed-citation></ref>
<ref id="ref-255"><label>255.</label><mixed-citation publication-type="web"><article-title>PyTorch Turns 5!</article-title> <year>2022</year> <month>Jan</month> <day>20</day>, <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=r7qB7mKJOFk">Youtube</ext-link>. <comment>1223</comment></mixed-citation></ref>
<ref id="ref-256"><label>256.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kaelbling</surname>, <given-names>L. P.</given-names></string-name>, <string-name><surname>Littman</surname>, <given-names>M. L.</given-names></string-name>, <string-name><surname>Moore</surname>, <given-names>A. W.</given-names></string-name></person-group> (<year>1996</year>). <article-title>Reinforcement learning: A survey</article-title>. <source>Journal of Artificial Intelligence Research</source>, <volume>4</volume>, <fpage>237</fpage>&#x2013;<lpage>285</lpage>. <comment>1223</comment></mixed-citation></ref>
<ref id="ref-257"><label>257.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Arulkumaran</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Deisenroth</surname>, <given-names>M. P.</given-names></string-name>, <string-name><surname>Brundage</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Bharath</surname>, <given-names>A. A.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Deep reinforcement learning: A brief survey</article-title>. <source>IEEE Signal Processing Magazine</source>, <volume>34</volume>, <fpage>26</fpage>&#x2013;<lpage>38</lpage>. <comment>1223</comment></mixed-citation></ref>
<ref id="ref-258"><label>258.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>S&#x00FC;nderhauf</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Brock</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Scheirer</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Hadsell</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Fox</surname>, <given-names>D.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2018</year>). <article-title>The limits and potentials of deep learning for robotics</article-title>. <source>The International Journal of Robotics Research</source>, <volume>37</volume>, <fpage>405</fpage>&#x2013;<lpage>420</lpage>. <comment>1224</comment></mixed-citation></ref>
<ref id="ref-259"><label>259.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Simo</surname>, <given-names>J. C.</given-names></string-name>, <string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1988</year>). <article-title>On the dynamics in space of rods undergoing large motions&#x2013;a geometrically exact approach</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>66</volume>, <fpage>125</fpage>&#x2013;<lpage>161</lpage>. <comment>1224</comment></mixed-citation></ref>
<ref id="ref-260"><label>260.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Humer</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2013</year>). <article-title>Dynamic modeling of beams with non-material, deformation-dependent boundary conditions</article-title>. <source>Journal of Sound and Vibration</source>, <volume>332</volume>(<issue>3</issue>), <fpage>622</fpage>&#x2013;<lpage>641</lpage>. <comment>1224</comment></mixed-citation></ref>
<ref id="ref-261"><label>261.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Steinbrecher</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Humer</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2017</year>). <article-title>On the numerical modeling of sliding beams: A comparison of different approaches</article-title>. <source>Journal of Sound and Vibration</source>, <volume>408</volume>, <fpage>270</fpage>&#x2013;<lpage>290</lpage>. <comment>1224</comment></mixed-citation></ref>
<ref id="ref-262"><label>262.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Humer</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Steinbrecher</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2020</year>). <article-title>General sliding-beam formulation: A non-material description for analysis of sliding structures and axially moving beams</article-title>. <source>Journal of Sound and Vibration</source>, <volume>480</volume>, <fpage>115341</fpage>. <ext-link ext-link-type="uri" xlink:href="https://linkinghub.elsevier.com/retrieve/pii/S0022460X20301723">Original website</ext-link>. <comment>1224</comment></mixed-citation></ref>
<ref id="ref-263"><label>263.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Bradbury</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Frostig</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Hawkins</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Johnson</surname>, <given-names>M. J.</given-names></string-name>, <string-name><surname>Leary</surname>, <given-names>C.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2018</year>). <article-title>JAX: composable transformations of Python+NumPy programs</article-title>. <ext-link ext-link-type="uri" xlink:href="http://github.com/google/jax">Original website</ext-link>. <comment>1224</comment></mixed-citation></ref>
<ref id="ref-264"><label>264.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Heek</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Levskaya</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Oliver</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ritter</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Rondepierre</surname>, <given-names>B.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Flax: A neural network library and ecosystem for JAX</article-title>. <ext-link ext-link-type="uri" xlink:href="http://github.com/google/flax">Original website</ext-link>. <comment>1224</comment></mixed-citation></ref>
<ref id="ref-265"><label>265.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Schoeberl</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2014</year>). <article-title>C++11 Implementation of Finite Elements in NGSolve</article-title>. <ext-link ext-link-type="uri" xlink:href="https://www.asc.tuwien.ac.at/~schoeberl/wiki/publications/ngs-cpp11.pdf">Scientific report</ext-link>. <comment>1225, 1226</comment></mixed-citation></ref>
<ref id="ref-266"><label>266.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Weitzhofer</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Humer</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Machine-Learning Frameworks in Scientific Computing: Finite Element Analysis and Multibody Simulation</article-title>. <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/alexander.humer/cmes-dl-review/-/blob/main/asme_idetc_msndc_2021/presentation_idetc.pdf">Talk slides</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/alexander.humer/cmes-dl-review/-/blob/main/asme_idetc_msndc_2021/humer_idetc.mp4">Video talk</ext-link>. <comment>1225</comment></mixed-citation></ref>
<ref id="ref-267"><label>267.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lavin</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Zenil</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Paige</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Krakauer</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Gottschlich</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Simulation Intelligence: Towards a New Generation of Scientific Methods</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2112.03235">arXiv:2112.03235</ext-link>. <comment>1225</comment></mixed-citation></ref>
<ref id="ref-268"><label>268.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cai</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Mao</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Yin</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Physics-informed neural networks (PINNs) for fluid mechanics: A review</article-title>. <source>Acta Mechanica Sinica</source>, <volume>37</volume>(<issue>12</issue>), <fpage>1727</fpage>&#x2013;<lpage>1738</lpage>. <ext-link ext-link-type="uri" xlink:href="https://link.springer.com/article/10.1007/s10409-021-01148-1">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2105.09506">arXiv:2105.09506</ext-link>. <comment>1225, 1226, 1227</comment></mixed-citation></ref>
<ref id="ref-269"><label>269.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Cuomo</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>di Cola</surname>, <given-names>V. S.</given-names></string-name>, <string-name><surname>Giampaolo</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Rozza</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Raissi</surname>, <given-names>M.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2022</year>). <article-title>Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What&#x2019;s next</article-title>. <source>Journal of Scientific Computing</source>, <volume>92</volume>(<issue>3</issue>). <comment>Article No. 88</comment>, <ext-link ext-link-type="uri" xlink:href="https://link.springer.com/article/10.1007/s10915-022-01939-z">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2201.05624">arXiv:2201.05624</ext-link>. <comment>1226, 1227, 1228</comment></mixed-citation></ref>
<ref id="ref-270"><label>270.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name>, <string-name><surname>Kevrekidis</surname>, <given-names>I. G.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Perdikaris</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Physics-informed machine learning</article-title>. <source>Nature Reviews Physics</source>, <volume>3</volume>(<issue>6</issue>), <fpage>422</fpage>&#x2013;<lpage>440</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/s42254-021-00314-5">Original website</ext-link>. <comment>1226, 1227, 1228</comment></mixed-citation></ref>
<ref id="ref-271"><label>271.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Meng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Mao</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2021</year>). <article-title>DeepXDE: A deep learning library for solving differential equations</article-title>. <source>SIAM Review</source>, <volume>63</volume>(<issue>1</issue>), <fpage>208</fpage>&#x2013;<lpage>228</lpage>. <ext-link ext-link-type="uri" xlink:href="https://epubs.siam.org/doi/10.1137/19M1274067">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://epubs.siam.org/doi/pdf/10.1137/19M1274067">pdf</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1907.04502">arXiv:1907.04502</ext-link>. <comment>1226, 1228</comment></mixed-citation></ref>
<ref id="ref-272"><label>272.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hennigh</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Narasimhan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Nabian</surname>, <given-names>M. A.</given-names></string-name>, <string-name><surname>Subramaniam</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Tangsali</surname>, <given-names>K.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>NVIDIA SimNet (tm): an AI-accelerated multi-physics simulation framework</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2012.07938">arXiv:2012.07938</ext-link>. <comment>The software name &#x201C;SimNet&#x201D; has been changed to &#x201C;Modulus&#x201D;; see</comment> <ext-link ext-link-type="uri" xlink:href="https://developer.nvidia.com/modulus">NVIDIA Modulus</ext-link>. <comment>1228</comment></mixed-citation></ref>
<ref id="ref-273"><label>273.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Koryagin</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Khudorozkov</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Tsimfer</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2019</year>). <article-title>PyDEns: a Python Framework for Solving Differential Equations with Neural Networks</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1909.11544">arXiv:1909.11544</ext-link>. <comment>1228</comment></mixed-citation></ref>
<ref id="ref-274"><label>274.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Sondak</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Protopapas</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mattheakis</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>S.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>NeuroDiffEq: A python package for solving differential equations with neural networks</article-title>. <source>Journal of Open Source Software</source>, <volume>5</volume>(<issue>46</issue>), <fpage>1931</fpage>. <ext-link ext-link-type="uri" xlink:href="https://joss.theoj.org/papers/10.21105/joss.01931">Original website</ext-link>. <comment>1227, 1228</comment></mixed-citation></ref>
<ref id="ref-275"><label>275.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rackauckas</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Nie</surname>, <given-names>Q.</given-names></string-name></person-group> (<year>2017</year>). <article-title>DifferentialEquations. jl&#x2013;a performant and feature-rich ecosystem for solving differential equations in Julia</article-title>. <source>Journal of Open Research Software</source>, <volume>5</volume>(<issue>1</issue>). <ext-link ext-link-type="uri" xlink:href="https://openresearchsoftware.metajnl.com/articles/10.5334/jors.151/">Original website</ext-link>. <comment>1227, 1228</comment></mixed-citation></ref>
<ref id="ref-276"><label>276.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Haghighat</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Juanes</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2021</year>). <article-title>SciANN: A keras/tensorflow wrapper for scientific computations and physics-informed deep learning using artificial neural networks</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>373</volume>, <fpage>113552</fpage>. <comment>1228</comment></mixed-citation></ref>
<ref id="ref-277"><label>277.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Xu</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Darve</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2020</year>). <article-title>ADCME: Learning Spatially-varying Physical Fields using Deep Neural Networks</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2011.11955">arXiv:2011.11955</ext-link>. <comment>1228</comment></mixed-citation></ref>
<ref id="ref-278"><label>278.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Gardner</surname>, <given-names>J. R.</given-names></string-name>, <string-name><surname>Pleiss</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Bindel</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Weinberger</surname>, <given-names>K. Q.</given-names></string-name>, <string-name><surname>Wilson</surname>, <given-names>A. G.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration</article-title>. <comment>[v6] Tue, 29 Jun 2021</comment> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1809.11165">arXiv:1809.11165</ext-link>. <comment>1228</comment></mixed-citation></ref>
<ref id="ref-279"><label>279.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Schoenholz</surname>, <given-names>S. S.</given-names></string-name>, <string-name><surname>Novak</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Fast and Easy Infinitely Wide Networks with Neural Tangents</article-title>. <comment>Google AI Blog, 2020 Mar 13</comment>, <ext-link ext-link-type="uri" xlink:href="https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks.html">Original website</ext-link>. <comment>1228, 1305</comment></mixed-citation></ref>
<ref id="ref-280"><label>280.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>He</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zheng</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2020</year>). <article-title>ReLU deep neural networks and linear finite elements</article-title>. <source>Journal of Computational Mathematics</source>, <volume>38</volume>(<issue>3</issue>), <fpage>502</fpage>&#x2013;<lpage>527</lpage>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1807.03973">arXiv:1807.03973</ext-link>. <comment>1228, 1230</comment></mixed-citation></ref>
<ref id="ref-281"><label>281.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Arora</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Basu</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Mianjy</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mukherjee</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Understanding deep neural networks with rectified linear units</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1611.01491">arXiv:1611.01491</ext-link>. <comment>1228</comment></mixed-citation></ref>
<ref id="ref-282"><label>282.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Raissi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Perdikaris</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations</article-title>. <source>Journal of Computational physics</source>, <volume>378</volume>, <fpage>686</fpage>&#x2013;<lpage>707</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S0021999118307125">Original website</ext-link>. <comment>1229, 1230</comment></mixed-citation></ref>
<ref id="ref-283"><label>283.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Kharazmi</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Variational physics-informed neural networks for solving partial differential equations</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1912.00873">arXiv:1912.00873</ext-link>. <comment>1229, 1230</comment></mixed-citation></ref>
<ref id="ref-284"><label>284.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Kharazmi</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2021</year>). <article-title>hp-vpinns: Variational physics-informed neural networks with domain decomposition</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>374</volume>, <fpage>113547</fpage>. <comment>See also</comment> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1912.00873">arXiv:1912.00873</ext-link>. <comment>1229, 1230</comment></mixed-citation></ref>
<ref id="ref-285"><label>285.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Berrone</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Canuto</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Pintore</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Variational physics informed neural networks: the role of quadratures and test functions</article-title>. <source>Journal of Scientific Computing</source>, <volume>92</volume>(<issue>3</issue>), <fpage>1</fpage>&#x2013;<lpage>27</lpage>. <ext-link ext-link-type="uri" xlink:href="https://link.springer.com/article/10.1007/s10915-022-01950-4">Original website</ext-link>. <comment>1229</comment></mixed-citation></ref>
<ref id="ref-286"><label>286.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Perdikaris</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2020</year>). <article-title>When and why pinns fail to train: A neural tangent kernel perspective</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2007.14527">arXiv:2007.14527</ext-link>. <comment>1230</comment></mixed-citation></ref>
<ref id="ref-287"><label>287.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Rohrhofer</surname>, <given-names>F. M.</given-names></string-name>, <string-name><surname>Posch</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>G&#x00F6;ssnitzer</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Geiger</surname>, <given-names>B. C.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Understanding the difficulty of training physics-informed neural networks on dynamical systems</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2203.13648">arXiv:2203.13648</ext-link>. <comment>1230</comment></mixed-citation></ref>
<ref id="ref-288"><label>288.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Erichson</surname>, <given-names>N. B.</given-names></string-name>, <string-name><surname>Muehlebach</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Mahoney</surname>, <given-names>M. W.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Physics-informed Autoencoders for Lyapunov-stable Fluid Flow Prediction</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1905.10866">arXiv:1905.10866</ext-link>. <comment>1230</comment></mixed-citation></ref>
<ref id="ref-289"><label>289.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Raissi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Perdikaris</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Physics informed learning machine</article-title>. <comment>US Patent 10,963,540, Mar 30</comment>. <ext-link ext-link-type="uri" xlink:href="https://patents.google.com/patent/US10963540B2/en">Google Patents</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://patentimages.storage.googleapis.com/da/91/71/d365de8d750b7d/US10963540.pdf">pdf</ext-link>. <comment>1230, 1231</comment></mixed-citation></ref>
<ref id="ref-290"><label>290.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lagaris</surname>, <given-names>I. E.</given-names></string-name>, <string-name><surname>Likas</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Fotiadis</surname>, <given-names>D. I.</given-names></string-name></person-group> (<year>1998</year>). <article-title>Artificial neural networks for solving ordinary and partial differential equations</article-title>. <source>IEEE transactions on neural networks</source>, <volume>9</volume>(<issue>5</issue>), <fpage>987</fpage>&#x2013;<lpage>1000</lpage>. <ext-link ext-link-type="uri" xlink:href="https://ieeexplore.ieee.org/document/712178">Original website</ext-link>. <comment>1230</comment>; <pub-id pub-id-type="pmid">18255782</pub-id></mixed-citation></ref>
<ref id="ref-291"><label>291.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lagaris</surname>, <given-names>I. E.</given-names></string-name>, <string-name><surname>Likas</surname>, <given-names>A. C.</given-names></string-name>, <string-name><surname>Papageorgiou</surname>, <given-names>D. G.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Neural-network methods for boundary value problems with irregular boundaries</article-title>. <source>IEEE Transactions on Neural Networks</source>, <volume>11</volume>(<issue>5</issue>), <fpage>1041</fpage>&#x2013;<lpage>1049</lpage>. <ext-link ext-link-type="uri" xlink:href="https://ieeexplore.ieee.org/document/870037">Original website</ext-link>. <comment>1230</comment>; <pub-id pub-id-type="pmid">18249832</pub-id></mixed-citation></ref>
<ref id="ref-292"><label>292.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Raissi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Perdikaris</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Physics Informed Deep Learning (Part I): Datadriven Solutions of Nonlinear Partial Differential Equations</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1711.10561">arXiv:1711.10561</ext-link>. <comment>1230</comment></mixed-citation></ref>
<ref id="ref-293"><label>293.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Raissi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Perdikaris</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Karniadakis</surname>, <given-names>G. E.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Physics Informed Deep Learning (Part II): Datadriven Discovery of Nonlinear Partial Differential Equations</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1711.10566">arXiv:1711.10566</ext-link>. <comment>1230</comment></mixed-citation></ref>
<ref id="ref-294"><label>294.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Gupta</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Agrawal</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gopalakrishnan</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Narayanan</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Deep learning with limited numerical precision</article-title>In <source>International Conference on Machine Learning</source>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1502.02551">arXiv:1502.02551</ext-link>. <comment>1239</comment></mixed-citation></ref>
<ref id="ref-295"><label>295.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Courbariaux</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hubara</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Soudry</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>El-Yaniv</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1602.02830">arXiv:1602.02830</ext-link>. <comment>1239</comment></mixed-citation></ref>
<ref id="ref-296"><label>296.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>De Sa</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Feldman</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>R&#x00E9;</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Olukotun</surname>, <given-names>K.</given-names></string-name></person-group> <article-title>Understanding and optimizing asynchronous lowprecision stochastic gradient descent</article-title>. In <conf-name>Proceedings of the 44th Annual International Symposium on Computer Architecture</conf-name>. <comment>1239</comment></mixed-citation></ref>
<ref id="ref-297"><label>297.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Borja</surname>, <given-names>R. I.</given-names></string-name></person-group> (<year>2000</year>). <article-title>A finite element model for strain localization analysis of strongly discontinuous fields based on standard galerkin approximation</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>190</volume>(<issue>11-12</issue>), <fpage>1529</fpage>&#x2013;<lpage>1549</lpage>. <comment>1247, 1250, 1251, 1252</comment></mixed-citation></ref>
<ref id="ref-298"><label>298.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sibson</surname>, <given-names>R. H.</given-names></string-name></person-group> (<year>1985</year>). <article-title>A note on fault reactivation</article-title>. <source>Journal of Structural Geology</source>, <volume>7</volume>(<issue>6</issue>), <fpage>751</fpage>&#x2013;<lpage>754</lpage>. <comment>1248</comment></mixed-citation></ref>
<ref id="ref-299"><label>299.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Passel&#x00E8;gue</surname>, <given-names>F. X.</given-names></string-name>, <string-name><surname>Brantut</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Mitchell</surname>, <given-names>T. M.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Fault reactivation by fluid injection: Controls from stress state and injection rate</article-title>. <source>Geophysical Research Letters</source>, <volume>45</volume>(<issue>23</issue>), <fpage>12</fpage>&#x2013;<lpage>837</lpage>. <comment>1249</comment></mixed-citation></ref>
<ref id="ref-300"><label>300.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Kuchment</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Even if injection of fracking wastewater stops, quakes won&#x2019;t</article-title>. <source>Scientific American</source>. <comment>Sep 9</comment>. <comment>1249</comment></mixed-citation></ref>
<ref id="ref-301"><label>301.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Park</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Paulino</surname>, <given-names>G. H.</given-names></string-name></person-group> (<year>2011</year>). <article-title>Cohesive zone models: a critical review of traction-separation relationships across fracture surfaces</article-title>. <source>Applied Mechanics Reviews</source>, <volume>64</volume>(<issue>6</issue>). <comment>1249</comment></mixed-citation></ref>
<ref id="ref-302"><label>302.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2007</year>). <article-title>An accurate elasto-plastic frictional tangential force&#x2013;displacement model for granular-flow simulations: Displacement-driven formulation</article-title>. <source>Journal of Computational Physics</source>, <volume>225</volume>(<issue>1</issue>), <fpage>730</fpage>&#x2013;<lpage>752</lpage>. <comment>1249, 1252</comment></mixed-citation></ref>
<ref id="ref-303"><label>303.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name></person-group> (<year>1999</year>). <article-title>An accurate and efficient tangential force&#x2013;displacement model for elastic frictional contact in particle-flow simulations</article-title>. <source>Mechanics of Materials</source>, <volume>31</volume>(<issue>4</issue>), <fpage>235</fpage>&#x2013;<lpage>269</lpage>. <comment>1252</comment></mixed-citation></ref>
<ref id="ref-304"><label>304.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Lesburg</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2001</year>). <article-title>Normal and tangential force&#x2013;displacement relations for frictional elasto-plastic contact of spheres</article-title>. <source>International Journal of Solids and Structures</source>, <source>38</source>(<source>36-37</source>), <fpage>6455</fpage>&#x2013;<lpage>6489</lpage>. <comment>1252</comment></mixed-citation></ref>
<ref id="ref-305"><label>305.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Haghighat</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Raissi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Moure</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gomez</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Juanes</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2021</year>). <article-title>A physics-informed deep learning framework for inversion and surrogate modeling in solid mechanics</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>379</volume>, <fpage>113741</fpage>. <comment>1252</comment></mixed-citation></ref>
<ref id="ref-306"><label>306.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhai</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2007</year>). <article-title>Analysis of power magnetic components with nonlinear static hysteresis: Proper orthogonal decomposition and model reduction</article-title>. <source>IEEE Transactions on Magnetics</source>, <volume>43</volume>(<issue>5</issue>), <fpage>1888</fpage>&#x2013;<lpage>1897</lpage>. <comment>1252, 1256, 1260</comment></mixed-citation></ref>
<ref id="ref-307"><label>307.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Benner</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Gugercin</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Willcox</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2015</year>). <article-title>A Survey of Projection-Based Model Reduction Methods for Parametric Dynamical Systems</article-title>. <source>SIAM Review</source>, <volume>57</volume>(<issue>4</issue>), <fpage>483</fpage>&#x2013;<lpage>531</lpage>. <comment>1261, 1263, 1271</comment></mixed-citation></ref>
<ref id="ref-308"><label>308.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Greif</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Urban</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Decay of the Kolmogorov N-width for wave problems</article-title>. <source>Applied Mathematics Letters</source>, <source>96</source>, <fpage>216</fpage>&#x2013;<lpage>222</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S0893965919301983">Original website</ext-link>. <comment>1261</comment></mixed-citation></ref>
<ref id="ref-309"><label>309.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Craig</surname>, <given-names>R. R.</given-names></string-name>, <string-name><surname>Bampton</surname>, <given-names>M. C. C.</given-names></string-name></person-group> (<year>1968</year>). <article-title>Coupling of substructures for dynamic analyses</article-title>. <source>AIAA Journal</source>, <volume>6</volume>(<issue>7</issue>), <fpage>1313</fpage>&#x2013;<lpage>1319</lpage>. <comment>1263</comment></mixed-citation></ref>
<ref id="ref-310"><label>310.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chaturantabut</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Sorensen</surname>, <given-names>D. C.</given-names></string-name></person-group> (<year>2010</year>). <article-title>Nonlinear Model Reduction via Discrete Empirical Interpolation</article-title>. <source>SIAM Journal on Scientific Computing</source>, <volume>32</volume>(<issue>5</issue>), <fpage>2737</fpage>&#x2013;<lpage>2764</lpage>. <comment>1267, 1268</comment></mixed-citation></ref>
<ref id="ref-311"><label>311.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Carlberg</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Bou-Mosleh</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Farhat</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2011</year>). <article-title>Efficient non-linear model reduction via a leastsquares Petrov-Galerkin projection and compressive tensor approximations</article-title>. <source>International Journal for Numerical Methods in Engineering</source>, <volume>86</volume>(<issue>2</issue>), <fpage>155</fpage>&#x2013;<lpage>181</lpage>. <comment>1267, 1268</comment></mixed-citation></ref>
<ref id="ref-312"><label>312.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Choi</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Coombs</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Anderson</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2020</year>). <article-title>SNS: A Solution-Based Nonlinear Subspace Method for Time-Dependent Model Order Reduction</article-title>. <source>SIAM Journal on Scientific Computing</source>, <volume>42</volume>(<issue>2</issue>), <fpage>A1116</fpage>&#x2013;<lpage>A1146</lpage>. <comment>1267</comment></mixed-citation></ref>
<ref id="ref-313"><label>313.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Everson</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Sirovich</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1995</year>). <article-title>Karhunen-Lo&#x00E8;ve procedure for gappy data</article-title>. <source>Journal of the Optical Society of America A</source>, <volume>12</volume>(<issue>8</issue>), <fpage>1657</fpage>. <comment>1267</comment></mixed-citation></ref>
<ref id="ref-314"><label>314.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Carlberg</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Farhat</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Cortial</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Amsallem</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2013</year>). <article-title>The GNAT method for nonlinear model reduction: Effective implementation and application to computational fluid dynamics and turbulent flows</article-title>. <source>Journal of Computational Physics</source>, <volume>242</volume>, <fpage>623</fpage>&#x2013;<lpage>647</lpage>. <comment>1268</comment></mixed-citation></ref>
<ref id="ref-315"><label>315.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Tiso</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Rixen</surname>, <given-names>D. J.</given-names></string-name></person-group> (<year>2013</year>). <article-title>Discrete empirical interpolation method for finite element structural dynamics</article-title>. <comment>1271</comment></mixed-citation></ref>
<ref id="ref-316"><label>316.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Brooks</surname>, <given-names>A. N.</given-names></string-name>, <string-name><surname>Hughes</surname>, <given-names>T. J.</given-names></string-name></person-group> (<year>1982</year>). <article-title>Streamline upwind/petrov-galerkin formulations for convection dominated flows with particular emphasis on the incompressible navier-stokes equations</article-title>. <source>Computer Methods in Applied Mechanics and Engineering</source>, <volume>32</volume>(<issue>1</issue>), <fpage>199</fpage>&#x2013;<lpage>259</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/0045782582900718">Original website</ext-link>. <comment>1272</comment></mixed-citation></ref>
<ref id="ref-317"><label>317.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kochkov</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Smith</surname>, <given-names>J. A.</given-names></string-name>, <string-name><surname>Alieva</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Brenner</surname>, <given-names>M. P.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Machine learning&#x2013;accelerated computational fluid dynamics</article-title>. <source>Proceedings of the National Academy of Sciences</source>, <volume>118</volume>(<issue>21</issue>), <fpage>e2101784118</fpage>. <comment>1276</comment></mixed-citation></ref>
<ref id="ref-318"><label>318.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bishara</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Xie</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>W. K.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2023</year>). <article-title>A state-of-the-art review on machine learning-based multiscale modeling, simulation, homogenization and design of materials</article-title>. <source>Archives of Computational Methods in Engineering</source>, <volume>30</volume>(<issue>1</issue>), <fpage>191</fpage>&#x2013;<lpage>222</lpage>. <comment>1277</comment></mixed-citation></ref>
<ref id="ref-319"><label>319.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rosenblatt</surname>, <given-names>F.</given-names></string-name></person-group> (<year>1960</year>). <article-title>Perceptron simulation experiments</article-title>. <source>Proceedings of the Institute of Radio Engineers</source>, <volume>48</volume>(<issue>3</issue>), <fpage>301</fpage>&#x2013;<lpage>309</lpage>. <comment>1278</comment></mixed-citation></ref>
<ref id="ref-320"><label>320.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Block</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Knight</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Rosenblatt</surname>, <given-names>F.</given-names></string-name></person-group> (<year>1962b</year>). <article-title>Analysis of a 4-layer series-coupled perceptron .2</article-title>. <source>Reviews of Modern Physics</source>, <volume>34</volume>(<issue>1</issue>), <fpage>135</fpage>&#x2013;<lpage>142</lpage>. <comment>1278</comment></mixed-citation></ref>
<ref id="ref-321"><label>321.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Gopnik</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2019</year>). <article-title>The ultimate learning machines</article-title>. <source>The Wall Street Journal</source>, <comment>Oct 11.</comment> <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S0893965919301983">Original website</ext-link>. <comment>1283, 1309</comment></mixed-citation></ref>
<ref id="ref-322"><label>322.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hodgkin</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Huxley</surname>, <given-names>A.</given-names></string-name></person-group> (<year>1952</year>). <article-title>A quantitative description of membrane current and its application to conduction and excitation in nerve</article-title>. <source>Journal of Physiology</source>, <volume>117</volume>(<issue>4</issue>), <fpage>500</fpage>&#x2013;<lpage>544</lpage>. <comment>1286, 1288</comment>; <pub-id pub-id-type="pmid">12991237</pub-id></mixed-citation></ref>
<ref id="ref-323"><label>323.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Dimirovski</surname>, <given-names>G. M.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Delay and recurrent neural networks: Computational cybernetics of systems biology?</article-title>In <source>2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC)</source>. <publisher-name>IEEE</publisher-name>. <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S0893965919301983">Original website</ext-link>. <comment>1287</comment></mixed-citation></ref>
<ref id="ref-324"><label>324.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Gherardi</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Souty-Grosset</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Vogt</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Dieguez-Uribeondo</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Crandall</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2009</year>). <chapter-title>Infraorder astacidea latreille, 1802 p.p.: The freshwater crayfish</chapter-title>. In <person-group person-group-type="editor"><string-name><given-names>F.</given-names> <surname>Schram</surname></string-name>, <string-name><given-names>C.</given-names> <surname>von Vaupel Klein</surname></string-name></person-group>, <comment>editors</comment>, <source>Treatise on Zoology - Anatomy, Taxonomy, Biology. The Crustacea, Volume 9 Part A</source>, <comment>chapter 67</comment>. <publisher-loc>Leiden, Netherlands</publisher-loc>: <publisher-name>Brill</publisher-name>, <fpage>269</fpage>&#x2013;<lpage>423</lpage>. <comment>1288</comment></mixed-citation></ref>
<ref id="ref-325"><label>325.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Han</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Moraga</surname>, <given-names>C.</given-names></string-name></person-group> <article-title>The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning</article-title>. In <source>IWANN &#x2019;96 Proceedings of the InternationalWorkshop on Artificial Neural Networks: From Natural to Artificial Neural Computation, Jun 07-09</source>. <comment>1288</comment></mixed-citation></ref>
<ref id="ref-326"><label>326.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Furshpan</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Potter</surname>, <given-names>D.</given-names></string-name></person-group> (<year>1959a</year>). <article-title>Transmission at the giant motor synapses of the crayfish</article-title>. <source>Journal of Physiology-London</source>, <volume>145</volume>(<issue>2</issue>), <fpage>289</fpage>&#x2013;<lpage>325</lpage>. <comment>1289, 1290</comment></mixed-citation></ref>
<ref id="ref-327"><label>327.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Bush</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Sejnowski</surname>, <given-names>T.</given-names></string-name></person-group> (<year>1995</year>). <source>The Cortical Neuron</source>. <publisher-name>Oxford University Press</publisher-name>. <comment>1289</comment></mixed-citation></ref>
<ref id="ref-328"><label>328.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Werbos</surname>, <given-names>P.</given-names></string-name></person-group> (<year>1990</year>). <article-title>Backpropagation through time - what it does and how to do it</article-title>. <source>Proceedings of the IEEE</source>, <volume>78</volume>(<issue>10</issue>), <fpage>1550</fpage>&#x2013;<lpage>1560</lpage>. <comment>1291, 1293</comment></mixed-citation></ref>
<ref id="ref-329"><label>329.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Baydin</surname>, <given-names>A. G.</given-names></string-name>, <string-name><surname>Pearlmutter</surname>, <given-names>B. A.</given-names></string-name>, <string-name><surname>Radul</surname>, <given-names>A. A.</given-names></string-name>, <string-name><surname>Siskind</surname>, <given-names>J. M.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Automatic Differentiation in Machine Learning: a Survey</article-title>. <source>Journal of Machine Learning Research</source>, <volume>18</volume>. <comment>1291, 1293</comment></mixed-citation></ref>
<ref id="ref-330"><label>330.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Werbos</surname>, <given-names>P. J.</given-names></string-name>, <string-name><surname>Davis</surname>, <given-names>J. J. J.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Regular Cycles of Forward and Backward Signal Propagation in Prefrontal Cortex and in Consciousness</article-title>. <source>Frontiers in Systems Neuroscience</source>, <volume>10</volume>. <comment>1293</comment></mixed-citation></ref>
<ref id="ref-331"><label>331.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Metz</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Turing Award Won by 3 Pioneers in Artificial Intelligence</article-title>. <source>New York Times</source>, (<italic>Mar 27</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S0893965919301983">Original website</ext-link>. <comment>1293</comment></mixed-citation></ref>
<ref id="ref-332"><label>332.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Topol</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2019</year>). <article-title>The A.I. Diet</article-title>. <source>New York Times</source>, (<italic>Mar 02</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S0893965919301983">Original website</ext-link>. <comment>1294</comment></mixed-citation></ref>
<ref id="ref-333"><label>333.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Laguarta</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hueto</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Subirana</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Covid-19 artificial intelligence diagnosis using only cough recordings</article-title>. <source>IEEE Open Journal of Engineering in Medicine and Biology</source>. <comment>1295, 1296</comment></mixed-citation></ref>
<ref id="ref-334"><label>334.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Heaven</surname>, <given-names>W.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Hundreds of ai tools have been built to catch covid. none of them helped</article-title>. <source>MIT Technological Review</source>. <comment>July 30</comment>. <comment>1294, 1295, 1296</comment></mixed-citation></ref>
<ref id="ref-335"><label>335.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wynants</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Van Calster</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Collins</surname>, <given-names>G. S.</given-names></string-name>, <string-name><surname>Riley</surname>, <given-names>R. D.</given-names></string-name>, <string-name><surname>Heinze</surname>, <given-names>G.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal</article-title>. <source>BMJ</source>, <volume>369</volume>. <comment>1294</comment></mixed-citation></ref>
<ref id="ref-336"><label>336.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Roberts</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Driggs</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Thorpe</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gilbey</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Yeung</surname>, <given-names>M.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans</article-title>. <source>Nature Machine Intelligence</source>, <volume>3</volume>(<issue>3</issue>), <fpage>199</fpage>&#x2013;<lpage>217</lpage>. <comment>1294</comment></mixed-citation></ref>
<ref id="ref-337"><label>337.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Moons</surname>, <given-names>K. G.</given-names></string-name>, <string-name><surname>de Groot</surname>, <given-names>J. A.</given-names></string-name>, <string-name><surname>Bouwmeester</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Vergouwe</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Mallett</surname>, <given-names>S.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2014</year>). <article-title>Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the charms checklist</article-title>. <source>PLoS Medicine</source>, <volume>11</volume>(<issue>10</issue>), <fpage>e1001744</fpage>. <comment>1294</comment>; <pub-id pub-id-type="pmid">25314315</pub-id></mixed-citation></ref>
<ref id="ref-338"><label>338.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wolff</surname>, <given-names>R. F.</given-names></string-name>, <string-name><surname>Moons</surname>, <given-names>K. G.</given-names></string-name>, <string-name><surname>Riley</surname>, <given-names>R. D.</given-names></string-name>, <string-name><surname>Whiting</surname>, <given-names>P. F.</given-names></string-name>, <string-name><surname>Westwood</surname>, <given-names>M.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2019</year>). <article-title>Probast: a tool to assess the risk of bias and applicability of prediction model studies</article-title>. <source>Annals of Internal Medicine</source>, <volume>170</volume>(<issue>1</issue>), <fpage>51</fpage>&#x2013;<lpage>58</lpage>. <comment>1294</comment>; <pub-id pub-id-type="pmid">30596875</pub-id></mixed-citation></ref>
<ref id="ref-339"><label>339.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Matei</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2020</year>). <article-title>An app could catch 98.5% of all Covid-19 infections. Why isn't it available?</article-title>. <source>The Guardian</source>, (<italic>Dec 16</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S0893965919301983">Original website</ext-link>. <comment>1296</comment></mixed-citation></ref>
<ref id="ref-340"><label>340.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Coppock</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Jones</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Kiskin</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Schuller</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Covid-19 detection from audio: seven grains of salt</article-title>. <source>The Lancet Digital Health</source>, <volume>3</volume>(<issue>9</issue>), <fpage>e537</fpage>&#x2013;<lpage>e538</lpage>. <comment>1296</comment>; <pub-id pub-id-type="pmid">34303644</pub-id></mixed-citation></ref>
<ref id="ref-341"><label>341.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y. D.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>Z.</given-names></string-name></person-group> (<year>2022</year>). <article-title>A Survey on Machine Learning in COVID-19 Diagnosis</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>130</volume>(<issue>1</issue>), <fpage>23</fpage>&#x2013;<lpage>71</lpage>. <comment>1296, 1297</comment></mixed-citation></ref>
<ref id="ref-342"><label>342.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Shao</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Deep Learning Applications for COVID-19 Analysis: A State-of-the-Art Survey</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>129</volume>(<issue>1</issue>), <fpage>65</fpage>&#x2013;<lpage>98</lpage>. <comment>1296, 1297</comment></mixed-citation></ref>
<ref id="ref-343"><label>343.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xie</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Lv</surname>, <given-names>Z.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Multi-Disease Prediction Based on Deep Learning: A Survey</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>2</issue>), <fpage>489</fpage>&#x2013;<lpage>522</lpage>. <comment>1296, 1297</comment></mixed-citation></ref>
<ref id="ref-344"><label>344.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gong</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>Z.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Predicting Genotype Information Related to COVID-19 for Molecular Mechanism Based on Computational Methods</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>129</volume>(<issue>1</issue>), <fpage>31</fpage>&#x2013;<lpage>45</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-345"><label>345.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Monajjemi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Esmkhani</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Mollaamin</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Shahriari</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Prediction of Proteins Associated with COVID-19 Based Ligand Designing and Molecular Modeling</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>3</issue>), <fpage>907</fpage>&#x2013;<lpage>926</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-346"><label>346.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Attaallah</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ahmad</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Seh</surname>, <given-names>A. H.</given-names></string-name>, <string-name><surname>Agrawal</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>R.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Estimating the Impact of COVID-19 Pandemic on the Research Community in the Kingdom of Saudi Arabia</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>126</volume>(<issue>1</issue>), <fpage>419</fpage>&#x2013;<lpage>436</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-347"><label>347.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gupta</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Jain</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Gupta</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Jain</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Real-Time Analysis of COVID-19 Pandemic on Most Populated Countries Worldwide</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>3</issue>), <fpage>943</fpage>&#x2013;<lpage>965</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-348"><label>348.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Areepong</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Sunthornwat</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Predictive Models for Cumulative Confirmed COVID-19 Cases by Day in Southeast Asia</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>3</issue>), <fpage>927</fpage>&#x2013;<lpage>942</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-349"><label>349.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Singh</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Bajpai</surname>, <given-names>M. K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>SEIHCRD Model for COVID-19 Spread Scenarios, Disease Predictions and Estimates the Basic Reproduction Number, Case Fatality Rate, Hospital, and ICU Beds Requirement</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>3</issue>), <fpage>991</fpage>&#x2013;<lpage>1031</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-350"><label>350.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Akyol</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Growing and Pruning Based Deep Neural Networks Modeling for Effective Parkinson&#x2019;s Disease Diagnosis</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>122</volume>(<issue>2</issue>), <fpage>619</fpage>&#x2013;<lpage>632</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-351"><label>351.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hemalakshmi</surname>, <given-names>G. R.</given-names></string-name>, <string-name><surname>Santhi</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Mani</surname>, <given-names>V. R. S.</given-names></string-name>, <string-name><surname>Geetha</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Prakash</surname>, <given-names>N. B.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Deep Residual Network Based on Image Priors for Single Image Super Resolution in FFA Images</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>1</issue>), <fpage>125</fpage>&#x2013;<lpage>143</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-352"><label>352.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vu-Quoc</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Zhai</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Ngo</surname>, <given-names>K. D. T.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Model reduction by generalized Falk method for efficient field-circuit simulations</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>129</volume>(<issue>3</issue>), <fpage>1441</fpage>&#x2013;<lpage>1486</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.techscience.com/CMES/v129n3/45696">DOI: 10.32604/cmes.2021.016784</ext-link>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-353"><label>353.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Saha</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Mojumder</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Al Amin</surname>, <given-names>A.</given-names></string-name>, <etal>et al</etal>.</person-group> (<year>2021</year>). <article-title>Reduced Order Machine Learning Finite Element Methods: Concept, Implementation, and Future Applications</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.32604/cmes.2021.017719">DOI: 10.32604/cmes.2021.017719</ext-link>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-354"><label>354.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Deng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Shao</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Hu</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Jiang</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Jiang</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Wind Power Forecasting Methods Based on Deep Learning: A Survey</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>122</volume>(<issue>1</issue>), <fpage>273</fpage>&#x2013;<lpage>301</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-355"><label>355.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Xi</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>X.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Data Augmentation Technology Driven By Image Style Transfer in Self-Driving Car Based on End-to-End Learning</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>122</volume>(<issue>2</issue>), <fpage>593</fpage>&#x2013;<lpage>617</lpage>. <comment>1297</comment></mixed-citation></ref>
<ref id="ref-356"><label>356.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sethi</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kathuria</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kaushik</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2021</year>). <article-title>A Real-Time Integrated Face Mask Detector to Curtail Spread of Coronavirus</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>127</volume>(<issue>2</issue>), <fpage>389</fpage>&#x2013;<lpage>409</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-357"><label>357.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Luo</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Gong</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Z.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>An Improved Data-Driven Topology Optimization Method Using Feature Pyramid Networks with Physical Constraints</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>3</issue>), <fpage>823</fpage>&#x2013;<lpage>848</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-358"><label>358.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Qu</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Di</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Feng</surname>, <given-names>Y. T.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>T.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Deep Learning Predicts Stress-Strain Relations of Granular Materials Based on Triaxial Testing Data</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>1</issue>), <fpage>129</fpage>&#x2013;<lpage>144</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-359"><label>359.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>X.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Deep Learning-Based Surrogate Model for Flight Load Analysis</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>2</issue>), <fpage>605</fpage>&#x2013;<lpage>621</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-360"><label>360.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.D.</given-names></string-name>, <string-name><surname>Jiang</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Yan</surname>, <given-names>H.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Classification of Domestic Refuse in Medical Institutions Based on Transfer Learning and Convolutional Neural Network</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>127</volume>(<issue>2</issue>), <fpage>599</fpage>&#x2013;<lpage>620</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-361"><label>361.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2020</year>). <article-title>PDNet: A Convolutional Neural Network Has Potential to be Deployed on Small Intelligent Devices for Arrhythmia Diagnosis</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>1</issue>), <fpage>365</fpage>&#x2013;<lpage>382</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-362"><label>362.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yin</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Dynamic Pricing Model of E-Commerce Platforms Based on Deep Reinforcement Learning</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>127</volume>(<issue>1</issue>), <fpage>291</fpage>&#x2013;<lpage>307</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-363"><label>363.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Ran</surname>, <given-names>X.</given-names></string-name></person-group> (<year>2021</year>). <article-title>A Step-Based Deep Learning Approach for Network Intrusion Detection</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>3</issue>), <fpage>1231</fpage>&#x2013;<lpage>1245</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-364"><label>364.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Park</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>J. H.</given-names></string-name>, <string-name><surname>Bang</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2021</year>). <article-title>PotholeEye plus: Deep-Learning Based Pavement Distress Detection System toward Smart Maintenance</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>127</volume>(<issue>3</issue>), <fpage>965</fpage>&#x2013;<lpage>976</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-365"><label>365.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mu</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Qiu</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Traffic Flow Statistics Method Based on Deep Learning and Multi-Feature Fusion</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>129</volume>(<issue>2</issue>), <fpage>465</fpage>&#x2013;<lpage>483</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-366"><label>366.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Peng</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>A Multi-View Gait Recognition Method Using Deep Convolutional Neural Network and Channel Attention Mechanism</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>1</issue>), <fpage>345</fpage>&#x2013;<lpage>363</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-367"><label>367.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shi</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zheng</surname>, <given-names>H.</given-names></string-name></person-group> (<year>2021</year>). <article-title>A Mortality Risk Assessment Approach on ICU Patients Clinical Medication Events Using Deep Learning</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>1</issue>), <fpage>161</fpage>&#x2013;<lpage>181</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-368"><label>368.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bian</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Stereo Matching Method Based on Space-Aware Network Model</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>127</volume>(<issue>1</issue>), <fpage>175</fpage>&#x2013;<lpage>189</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-369"><label>369.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kong</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Combining Trend-Based Loss with Neural Network for Air Quality Forecasting in Internet of Things</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>2</issue>), <fpage>849</fpage>&#x2013;<lpage>863</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-370"><label>370.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jothiramalingam</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Jude</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Hemanth</surname>, <given-names>D. J.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Review of Computational Techniques for the Analysis of Abnormal Patterns of ECG Signal Provoked by Cardiac Disease</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>3</issue>), <fpage>875</fpage>&#x2013;<lpage>906</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-371"><label>371.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Xin</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>Q.</given-names></string-name></person-group> (<year>2021</year>). <article-title>An Improved Algorithm for the Detection of Fastening Targets Based on Machine Vision</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>2</issue>), <fpage>779</fpage>&#x2013;<lpage>802</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-372"><label>372.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Fang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>J.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Intelligent Segmentation and Measurement Model for Asphalt Road Cracks Based on Modified Mask R-CNN Algorithm</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>128</volume>(<issue>2</issue>), <fpage>541</fpage>&#x2013;<lpage>564</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-373"><label>373.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Luo</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Shen</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Peng</surname>, <given-names>Q.</given-names></string-name></person-group> (<year>2021</year>). <article-title>A Novel Named Entity Recognition Scheme for Steel E-Commerce Platforms Using a Lite BERT</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>129</volume>(<issue>1</issue>), <fpage>47</fpage>&#x2013;<lpage>63</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-374"><label>374.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Q.</given-names></string-name></person-group> (<year>2020</year>). <article-title> Short-Term Traffic Flow Prediction Based on LSTM-XGBoost Combination Model</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>1</issue>), <fpage>95</fpage>&#x2013;<lpage>109</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-375"><label>375.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name></person-group> (<year>2020</year>). <article-title>An Emotion Analysis Method Using Multi-Channel Convolution Neural Network in Social Networks</article-title>. <source>CMES-Computer Modeling in Engineering &#x0026; Sciences</source>, <volume>125</volume>(<issue>1</issue>), <fpage>281</fpage>&#x2013;<lpage>297</lpage>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-376"><label>376.</label><mixed-citation publication-type="web"><article-title>Safety Test Reveals Tesla&#x2019;s Full Self-Driving Software Repeatedly Hits Child-Sized Mannequin</article-title>. <comment>The Dawn Project, 2022.08.09</comment>, <ext-link ext-link-type="uri" xlink:href="https://dawnproject.com/safety-test-reveals-teslas-full-self-driving-software-repeatedly-hits-child-sized-mannequin/">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220817190136/https://dawnproject.com/safety-test-reveals-teslas-full-self-driving-software-repeatedly-hits-child-sized-mannequin/">Internet archived on 2022.08.17</ext-link>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-377"><label>377.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Helmore</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Tesla&#x2019;s self-driving technology fails to detect children in the road, tests find</article-title>. <source>The Guardian</source>, (<italic>Aug 09</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.theguardian.com/technology/2022/aug/09/tesla-self-driving-technology-safety-children">Original website</ext-link>. <comment>1298, 1299, 1300</comment></mixed-citation></ref>
<ref id="ref-378"><label>378.</label><mixed-citation publication-type="web"><article-title>Does Tesla Full Self-Driving Beta really run over kids?</article-title> <comment>Whole Mars Catalog, 2022.08.14</comment>, <ext-link ext-link-type="uri" xlink:href="https://twitter.com/WholeMarsBlog/status/1558876752062976000">Tweet</ext-link>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-379"><label>379.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Roth</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2022</year>). <article-title>YouTube removes video that tests Tesla&#x2019;s Full Self-Driving beta against real kids</article-title>. <source>The Verge</source>, (<italic>Aug 20</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.theverge.com/2022/8/20/23314117/youtube-tesla-removes-video-full-self-driving-beta-real-kids">Original website</ext-link>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-380"><label>380.</label><mixed-citation publication-type="web"><article-title>Musk&#x2019;s Full Self-Driving @Tesla ruthlessly mowing down a child mannequin</article-title>. <comment>Dan O&#x2019;Dowd, The Dawn Project, 2022.08.15</comment>, <ext-link ext-link-type="uri" xlink:href="https://twitter.com/RealDanODowd/status/1559245760054575112">Tweet</ext-link>. <comment>1299</comment></mixed-citation></ref>
<ref id="ref-381"><label>381.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hawkins</surname>, <given-names>A. J.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Tesla wants videos of its cars running over child-sized dummies taken down</article-title>. <source>The Verge</source>, (<italic>Aug 25</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.theverge.com/2022/8/20/23314117/youtube-tesla-removes-video-full-self-driving-beta-real-kids">Original website</ext-link>. <comment>1299</comment></mixed-citation></ref>
<ref id="ref-382"><label>382.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Metz</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Koeze</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Can Tesla Data Help Us Understand Car Crashes?</article-title> <source>New York Times</source>, (<italic>Aug 18</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/interactive/2022/08/18/business/tesla-crash-data.html">Original website</ext-link>. <comment>1300, 1301, 1302, 1303</comment></mixed-citation></ref>
<ref id="ref-383"><label>383.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Metz</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2017</year>). <article-title>A New Way for Machines to See, Taking Shape in Toronto</article-title>. <source>New York Times</source>, (<italic>Nov 28</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2017/11/28/technology/artificial-intelligence-research-toronto.html">Original website</ext-link>. <comment>1298</comment></mixed-citation></ref>
<ref id="ref-384"><label>384.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Dujmovic</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2021</year>). <article-title>You will not be traveling in a self-driving car anytime soon. Here&#x2019;s what the future will look like</article-title>. <source>Market Watch</source>, (<italic>June 16 - Updated June 19</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.marketwatch.com/story/you-will-not-be-traveling-in-a-self-driving-car-anytime-soon-heres-what-the-future-will-look-like-11623866219">Original website</ext-link>. <comment>1298, 1300</comment></mixed-citation></ref>
<ref id="ref-385"><label>385.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Maresca</surname>, <given-names>T.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Hyundai&#x2019;s self-driving taxis roll out on the streets of South Korea</article-title>. <source>UPI</source>, (<italic>Jun 09</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.upi.com/Top_News/World-News/2022/06/09/hyundai-autonomous-driving-roboride-taxi/7541654783750/">Original website</ext-link>. <comment>1299</comment></mixed-citation></ref>
<ref id="ref-386"><label>386.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Kirkpatrick</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Still Waiting for Self-Driving Cars</article-title>. <source>Communications of the ACM</source>, (<italic>April</italic>). <ext-link ext-link-type="uri" xlink:href="https://cacm.acm.org/magazines/2022/4/259392-still-waiting-for-self-driving-cars/fulltext">Original website</ext-link>. <comment>1299, 1300</comment></mixed-citation></ref>
<ref id="ref-387"><label>387.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Bogna</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Is Your Car Autonomous? The 6 Levels of Self-Driving Explained</article-title>. <source>PC Magazine</source>, (<italic>June 14</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.pcmag.com/how-to/6-levels-of-autonomous-self-driving-explained">Original website</ext-link>. <comment>1299</comment></mixed-citation></ref>
<ref id="ref-388"><label>388.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Boudette</surname>, <given-names>N.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Despite High Hopes, Self-Driving Cars Are &#x2018;Way in the Future&#x2019;</article-title>. <source>New York Times</source>, (<italic>Jul 07</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2019/07/17/business/self-driving-autonomous-cars.html">Original website</ext-link>. <comment>1300</comment></mixed-citation></ref>
<ref id="ref-389"><label>389.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Guinness</surname>, <given-names>H.</given-names></string-name></person-group> (<year>2022</year>). <article-title>What&#x2019;s going on with self-driving cars right now?</article-title> <source>Popular Science</source>, (<italic>May 28</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.popsci.com/technology/self-driving-car-companies-status/">Original website</ext-link>. <comment>1301</comment></mixed-citation></ref>
<ref id="ref-390"><label>390.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Smiley</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2022</year>). <article-title>&#x2018;I&#x2019;m the Operator&#x2019;: The Aftermath of a Self-Driving Tragedy</article-title>. <source>WIRED</source>, (<italic>Mar 8</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.wired.com/story/uber-self-driving-car-fatal-crash/">Original website</ext-link>. <comment>1301, 1302</comment></mixed-citation></ref>
<ref id="ref-391"><label>391.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Metz</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Griffith</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2022</year>). <article-title>This Was Supposed to Be the Year Driverless Cars Went Mainstream</article-title>. <source>New York Times</source>, (<italic>May 12 - Updated Sep 15</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2020/05/12/technology/self-driving-cars-coronavirus.html">Original website</ext-link>. <comment>1302</comment></mixed-citation></ref>
<ref id="ref-392"><label>392.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Metz</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2022</year>). <article-title>The Costly Pursuit of Self-Driving Cars Continues On. And On. And On</article-title>. <source>New York Times</source>, (<italic>May 24 - Updated Sep 15</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2021/05/24/technology/self-driving-cars-wait.html">Original website</ext-link>. <comment>1302</comment></mixed-citation></ref>
<ref id="ref-393"><label>393.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Nims</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Robot Boats Leave Autonomous Cars in Their Wake&#x2014;Unmanned ships don&#x2019;t have to worry about crowded roads. But crossing of the Atlantic is still a challenge</article-title>. <source>Wall Street Journal</source>, (<italic>Aug 29</italic>). <comment>1303</comment></mixed-citation></ref>
<ref id="ref-394"><label>394.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>O&#x2019;Brien</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Autonomous Mayflower reaches American shores&#x2014;in Canada</article-title>. <source>ABC News</source>, (<italic>Jun 05</italic>). <ext-link ext-link-type="uri" xlink:href="https://abcnews.go.com/Business/wireStory/autonomous-mayflower-reaches-american-shores-canada-85196228">Original website</ext-link>. <comment>1303, 1304</comment></mixed-citation></ref>
<ref id="ref-395"><label>395.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Mitchell</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Artificial intelligence hits the barrier of meaning</article-title>. <source>The New York Times</source>. <comment>1304</comment></mixed-citation></ref>
<ref id="ref-396"><label>396.</label><mixed-citation publication-type="web"><article-title>New Survey: Americans Think AI Is a Threat to Democracy, Will Become Smarter than Humans and Overtake Jobs, Yet Believe its Benefits Outweigh its Risks</article-title>. <publisher-name>Stevens Institute of Technology</publisher-name>, <year>2021</year> <month>Nov</month> <day>15</day>, <ext-link ext-link-type="uri" xlink:href="https://www.stevens.edu/news/new-survey-americans-think-ai-threat-democracy-will-become-smarter-humans-and-overtake-jobs-yet">Website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220308011755/https://www.stevens.edu/news/new-survey-americans-think-ai-threat-democracy-will-become-smarter-humans-and-overtake-jobs-yet">Internet archive</ext-link>. <comment>1306</comment></mixed-citation></ref>
<ref id="ref-397"><label>397.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Lyu</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Exposing gan-generated faces using inconsistent corneal specular highlights</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2009.11924">arXiv:2009.11924</ext-link>. <comment>1306</comment></mixed-citation></ref>
<ref id="ref-398"><label>398.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Sencar</surname>, <given-names>H. T.</given-names></string-name>, <string-name><surname>Verdoliva</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Memon</surname>, <given-names>N.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Multimedia forensics</article-title>. <comment>1306</comment></mixed-citation></ref>
<ref id="ref-399"><label>399.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Chesney</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Citron</surname>, <given-names>D. K.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security</article-title>. <source>107 California Law Review 1753</source>. <ext-link ext-link-type="uri" xlink:href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3213954">Original website</ext-link> <comment>U of Texas Law, Public Law Research Paper No. 692. U of Maryland Legal Studies Research Paper No. 2018-21. 1306, 1307</comment></mixed-citation></ref>
<ref id="ref-400"><label>400.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Ingram</surname>, <given-names>D.</given-names></string-name>,<string-name><surname>Ward</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2019</year>). <article-title>How do you spot a deepfake? A clue hides within our voices, researchers say</article-title>. <source>NBC News</source>, (<italic>Dec 16</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nbcnews.com/tech/tech-news/little-tells-why-battle-against-deepfakes-2020-may-rely-verbal-n1102881">Original website</ext-link>. <comment>1306, 1307</comment></mixed-citation></ref>
<ref id="ref-401"><label>401.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Citron</surname>, <given-names>D.</given-names></string-name></person-group> <article-title>How deepfakes undermine truth and threaten democracy</article-title>. <comment>TEDSummit 2019</comment>. <ext-link ext-link-type="uri" xlink:href="https://www.ted.com/talks/danielle_citron_how_deepfakes_undermine_truth_and_threaten_democracy?language=en">Website</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-402"><label>402.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Manheim</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Kaplan</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Artificial Intelligence: Risks to Privacy and Democracy</article-title>. <source>Yale Journal of Law &#x0026; Technology</source>, <source>21</source>, <fpage>106</fpage>&#x2013;<lpage>188</lpage>. <ext-link ext-link-type="uri" xlink:href="https://yjolt.org/artificial-intelligence-risks-privacy-and-democracy">Original website</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-403"><label>403.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hao</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Why AI is a threat to democracy&#x2014;and what we can do to stop it</article-title>. <source>MIT Technology Review</source>, (<italic>Feb 26</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.technologyreview.com/2019/02/26/66043/why-ai-is-a-threat-to-democracyand-what-we-can-do-to-stop-it/">Original website</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-404"><label>404.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Feldstein</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2019</year>). <article-title>How Artificial Intelligence Systems Could Threaten Democracy</article-title>. <source>Carnegie Endowment for International Peace</source>, (<italic>Apr 24</italic>). <ext-link ext-link-type="uri" xlink:href="https://carnegieendowment.org/2019/04/24/how-artificial-intelligence-systems-could-threaten-democracy-pub-78984">Original website</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-405"><label>405.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Pearce</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Beware the Privacy Violations in Artificial Intelligence Applications</article-title>. <source>ISACA Now Blog</source>, (<italic>May 28</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.isaca.org/resources/news-and-trends/isaca-now-blog/2021/beware-the-privacy-violations-in-artificial-intelligence-applications">Original website</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-406"><label>406.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Harwell</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Top AI researchers race to detect &#x2018;deepfake&#x2019; videos: &#x2018;We are outgunned&#x2019;</article-title>. <source>Washington Post</source>, (<italic>Jun 12</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.washingtonpost.com/technology/2019/06/12/top-ai-researchers-race-detect-deepfake-videos-we-are-outgunned/">Original website</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-407"><label>407.</label><mixed-citation publication-type="web"><article-title>Deepfake Detection Challenge: Identify videos with facial or voice manipulations</article-title>. <comment>2019-2020</comment>, <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/competitions/deepfake-detection-challenge/overview">Overview</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/competitions/deepfake-detection-challenge/leaderboard">Leaderboard</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-408"><label>408.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Groh</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Epstein</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Firestone</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Picard</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Deepfake detection by human crowds, machines, and machine-informed crowds</article-title>. <source>Proceedings of the National Academy of Sciences</source>, <volume>119</volume>(<issue>1</issue>), <fpage>e2110013119</fpage>. <ext-link ext-link-type="uri" xlink:href="https://www.pnas.org/doi/abs/10.1073/pnas.2110013119">Original website</ext-link>. <comment>1307, 1308</comment></mixed-citation></ref>
<ref id="ref-409"><label>409.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hintze</surname>, <given-names>J. L.</given-names></string-name>, <string-name><surname>Nelson</surname>, <given-names>R. D.</given-names></string-name></person-group> (<year>1998</year>). <article-title>Violin plots: a box plot-density trace synergism</article-title>. <source>The American Statistician</source>, <volume>52</volume>(<issue>2</issue>), <fpage>181</fpage>&#x2013;<lpage>184</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.jstor.org/stable/2685478?origin=JSTOR-pdf">JSTOR</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-410"><label>410.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lewinson</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Violin plots explained</article-title>. <source>Toward Data Science</source>, (<italic>Oct 21</italic>). <ext-link ext-link-type="uri" xlink:href="https://towardsdatascience.com/violin-plots-explained-fb1d115e023d">Original website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://github.com/erykml/medium_articles/blob/master/Statistics/violin_plots.ipynb">GitHub</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-411"><label>411.</label><mixed-citation publication-type="web"><article-title>Detect DeepFakes: How to counteract misinformation created by AI</article-title>. <comment>MIT Media Lab, project contact Matt Groh</comment>. <ext-link ext-link-type="uri" xlink:href="https://www.media.mit.edu/projects/detect-fakes/overview/">Website</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20220731111713/https://www.media.mit.edu/projects/detect-fakes/overview/">Internet archive</ext-link>. <comment>1307</comment></mixed-citation></ref>
<ref id="ref-412"><label>412.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hill</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>The Secretive Company That Might End Privacy as We Know It</article-title>. <source>New York Times</source>, (<italic>Feb 10, Updated 2021 Nov 2</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html">Original website</ext-link>. <comment>1308</comment></mixed-citation></ref>
<ref id="ref-413"><label>413.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Morrison</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2020</year>). <article-title>The world&#x2019;s scariest facial recognition company is now linked to everybody from ICE to Macy&#x2019;s</article-title>. <source>Vox</source>, (<italic>Feb 28</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.vox.com/recode/2020/2/26/21154606/clearview-ai-data-breach">Original website</ext-link>. <comment>1308</comment></mixed-citation></ref>
<ref id="ref-414"><label>414.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Hill</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Wrongfully Accused by an Algorithm</article-title>. <source>New York Times</source>, (<italic>Jun 24, Updated Aug 03</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2020/06/24/technology/facial-recognition-arrest.html">Original website</ext-link>. <comment>1309</comment></mixed-citation></ref>
<ref id="ref-415"><label>415.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Raji</surname>, <given-names>I. D.</given-names></string-name>, <string-name><surname>Smart</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>White</surname>, <given-names>R. N.</given-names></string-name>, <string-name><surname>Mitchell</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gebru</surname>, <given-names>T.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing</article-title>In <source>Proceedings of the 2020 conference on fairness, accountability, and transparency</source>. <ext-link ext-link-type="uri" xlink:href="https://dl.acm.org/doi/10.1145/3351095.3372873">Original website</ext-link>. <comment>1309</comment></mixed-citation></ref>
<ref id="ref-416"><label>416.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Metz</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Who Is Making Sure the A.I. Machines Aren&#x2019;t Racist?</article-title> <source>New York Times</source>, (<italic>Mar 15</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2021/03/15/technology/artificial-intelligence-google-bias.html">Original website</ext-link>. <comment>1309</comment></mixed-citation></ref>
<ref id="ref-417"><label>417.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Samuel</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Why it&#x2019;s so damn hard to make AI fair and unbiased</article-title>. <source>Vox</source>, (<italic>Apr 19</italic>). <comment>Future Perfect</comment>, <ext-link ext-link-type="uri" xlink:href="https://www.vox.com/future-perfect/22916602/ai-bias-fairness-tradeoffs-artificial-intelligence">Original website</ext-link>. <comment>1309</comment></mixed-citation></ref>
<ref id="ref-418"><label>418.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Heilinger</surname>, <given-names>J. C.</given-names></string-name></person-group> (<year>2022</year>). <article-title>The Ethics of AI Ethics. A Constructive Critique</article-title>. <source>Philosophy &#x0026; Technology</source>, <volume>35</volume>(<issue>3</issue>), <fpage>1</fpage>&#x2013;<lpage>20</lpage>. <ext-link ext-link-type="uri" xlink:href="https://link.springer.com/article/10.1007/s13347-022-00557-9">Original website</ext-link>. <comment>1309</comment></mixed-citation></ref>
<ref id="ref-419"><label>419.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Heikkil&#x00E4;</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2022</year>). <article-title>The walls are closing in on Clearview AI as data watchdogs get tough</article-title>. <source>MIT Technology Review</source>, (<italic>May 24</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.technologyreview.com/2022/05/24/1052653/clearview-ai-data-privacy-uk/">Original website</ext-link>. <comment>1309</comment></mixed-citation></ref>
<ref id="ref-420"><label>420.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Metz</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Isaac</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Facebook&#x2019;s A.I. Whiz Now Faces the Task of Cleaning It Up. Sometimes That Brings Him to Tears</article-title>. <source>New York Times</source>, (<italic>May 17</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.nytimes.com/2019/05/17/technology/facebook-ai-schroepfer.html">Original website</ext-link>. <comment>1309</comment></mixed-citation></ref>
<ref id="ref-421"><label>421.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Haibe-Kains</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Adam</surname>, <given-names>G. A.</given-names></string-name>, <string-name><surname>Hosny</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Khodakarami</surname>, <given-names>F.</given-names></string-name>, <collab>Massive Analysis Quality Control (MAQC) Society Board of Directors</collab>, <etal>et al</etal>.</person-group> (<year>2020</year>). <article-title>Transparency and reproducibility in artificial intelligence</article-title>. <source>Nature</source>. <month>Oct</month> <day>14</day>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s41586-020-2766-y">doi.org/10.1038/s41586-020-2766-y</ext-link>. <comment>1310, 1311</comment></mixed-citation></ref>
<ref id="ref-422"><label>422.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Heaven</surname>, <given-names>W. D.</given-names></string-name></person-group> (<year>2020</year>). <article-title>AI is wrestling with a replication crisis</article-title>. <source>MIT Technology Review</source>, (<italic>Aug 29</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.technologyreview.com/2020/11/12/1011944/artificial-intelligence-replication-crisis-science-big-tech-google-deepmind-facebook-openai/">Original website</ext-link>. <comment>1310, 1311</comment></mixed-citation></ref>
<ref id="ref-423"><label>423.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>McKinney</surname>, <given-names>S. M.</given-names></string-name>, <string-name><surname>Sieniek</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Godbole</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Godwin</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Antropova</surname>, <given-names>N.</given-names></string-name>, <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>International evaluation of an AI system for breast cancer screening</article-title>. <source>Nature</source>, <volume>577</volume>(<issue>7788</issue>), <fpage>89</fpage>&#x2013;<lpage>94</lpage>. <comment>1311</comment></mixed-citation></ref>
<ref id="ref-424"><label>424.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Trevithick</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2020</year>). <article-title>General Atomics Avenger Drone Flew An Autonomous Air-To-Air Mission Using An AI Brain</article-title>. <source>The Drive</source>, (<italic>Dec 4</italic>). <ext-link ext-link-type="uri" xlink:href="https://www.thedrive.com/the-war-zone/37973/general-atomics-avenger-drone-flew-an-autonomous-air-to-air-mission-using-an-ai-brain">Original website</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://web.archive.org/web/20201207155027/https://www.thedrive.com/the-war-zone/37973/general-atomics-avenger-drone-flew-an-autonomous-air-to-air-mission-using-an-ai-brain">Internet Archive</ext-link>. <comment>1311</comment></mixed-citation></ref>
<ref id="ref-425"><label>425.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sepulchre</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Cybernetics [From the Editor]</article-title>. <source>IEEE Control Systems Magazine</source>, <volume>40</volume>(<issue>2</issue>), <fpage>3</fpage>&#x2013;<lpage>4</lpage>. <comment>1339</comment></mixed-citation></ref>
<ref id="ref-426"><label>426.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Copeland</surname>, <given-names>A.</given-names></string-name></person-group> (<year>1949</year>). <article-title>A cybernetic model of memory and recognition</article-title>. <source>Bulletin of the American Mathematical Society</source>, <volume>55</volume>(<issue>7</issue>), <fpage>698</fpage>. <comment>1340</comment></mixed-citation></ref>
<ref id="ref-427"><label>427.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chavalarias</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2020</year>). <article-title>From inert matter to the global society life as multi-level networks of processes</article-title>. <source>Philosophical Transactions of the Royal Society B-Biological Sciences</source>, <volume>375</volume>(<issue>1796</issue>). <comment>1340</comment></mixed-citation></ref>
<ref id="ref-428"><label>428.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Togashi</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Miyata</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Yamamoto</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2020</year>). <article-title>The first world championship in cybernetic building optimization</article-title>. <source>Journal of Building Performance Simulation</source>, <volume>13</volume>(<issue>3</issue>), <fpage>391</fpage>&#x2013;<lpage>408</lpage>. <comment>1340</comment></mixed-citation></ref>
<ref id="ref-429"><label>429.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Jube</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Labour and international accounting standards: A question of social justice</article-title>. <source>International Labour Review</source>. <comment>Early Access Date MAR 2020</comment>. <comment>1340</comment></mixed-citation></ref>
<ref id="ref-430"><label>430.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>McCulloch</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Pitts</surname>, <given-names>W.</given-names></string-name></person-group> (<year>1943</year>). <article-title>A logical calculus of the ideas immanent in nervous activity</article-title>. <source>Bulletin of Mathematical Biophysics</source>, <volume>5</volume>, <fpage>115</fpage>&#x2013;<lpage>133</lpage>. <comment>Reprinted in the <italic>Bulletin of Mathematical Biology</italic>, Vol.52, No.1-2, pp.99-115, 1990</comment>. <comment>1340, 1343</comment></mixed-citation></ref>
<ref id="ref-431"><label>431.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Kline</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2015</year>). <source>The Cybernetic Moment, or why we call our Age the Information Age</source>. <publisher-loc>Baltimore</publisher-loc>: <publisher-name>Johns Hopkins University Press</publisher-name>. <comment>1340, 1341, 1342, 1343</comment></mixed-citation></ref>
<ref id="ref-432"><label>432.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cariani</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2017</year>). <article-title>The Cybernetics Moment: or WhyWe Call Our Age the Information Age</article-title>. <source>Cognitive Systems Research</source>, <volume>43</volume>, <fpage>119</fpage>&#x2013;<lpage>124</lpage>. <comment>1340, 1341</comment></mixed-citation></ref>
<ref id="ref-433"><label>433.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Eisenhart</surname>, <given-names>C.</given-names></string-name></person-group> (<year>1949</year>). <article-title>Cybernetics - A new discipline</article-title>. <source>Science</source>, <volume>109</volume>(<issue>2834</issue>), <fpage>397</fpage>&#x2013;<lpage>399</lpage>. <comment>1340</comment>; <pub-id pub-id-type="pmid">17749955</pub-id></mixed-citation></ref>
<ref id="ref-434"><label>434.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W. E. H.</given-names></string-name></person-group> (<year>1949</year>). <article-title>Book Review: Cybernetics: Or Control and Communication in the Animal and the Machine</article-title>. <source>Quarterly Journal of Experimental Psychology</source>, <volume>1</volume>(<issue>4</issue>), <fpage>193</fpage>&#x2013;<lpage>194</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1080/17470214908416765">https://doi.org/10.1080/17470214908416765</ext-link>. <comment>1341</comment></mixed-citation></ref>
</ref-list>
<app-group> 
<app id="app1">
<label>Appendices</label>
<sec id="s15">
<title>1 Backprop pseudocodes, notation comparison</title>
<p>To connect the backpropagation Algorithm <xref ref-type="fig" rid="fig-159">1</xref> in Section <xref ref-type="sec" rid="s5">5</xref> to Algorithm 6.4 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p.206, Section 6.5.4 on &#x201C;Back-Propagation Computation in Fully Connected MLP&#x201D;,<xref ref-type="fn" rid="fn338"><sup>338</sup></xref><fn id="fn338"><label>338</label><p>MLP = MultiLayer Perceptron.</p></fn> a different form of Algorithm <xref ref-type="fig" rid="fig-159">1</xref>, where the &#x201C;while&#x201D; loop is used, is provided in Algorithm <xref ref-type="fig" rid="fig-167">9</xref>, where the &#x201C;for&#x201D; loop is used. This information would be especially useful for first-time learners. See also Remark <xref ref-type="statement" rid="st5_4">5.4</xref>.</p>
<fig id="fig-167">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-167.tif"/>
</fig>
<p>In Algorithm <xref ref-type="fig" rid="fig-167">9</xref>, the regularization of the cost function <inline-formula id="ieqn-2867"><mml:math id="mml-ieqn-2867"><mml:mi>J</mml:mi></mml:math></inline-formula> is not considered, i.e., we omit the penalty (or regularization) term <inline-formula id="ieqn-2868"><mml:math id="mml-ieqn-2868"><mml:mi mathvariant="normal">&#x03BB;</mml:mi><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> used in Algorithm 6.4 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p.206, where the regularized cost was <inline-formula id="ieqn-2869"><mml:math id="mml-ieqn-2869"><mml:mi>J</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. As pointed out in Section <xref ref-type="sec" rid="s6_3_6">6.3.6</xref>, weight decay is more general than <inline-formula id="ieqn-2870"><mml:math id="mml-ieqn-2870"><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> regularization, and would be the preferred method to avoid overfitting.</p>
<p>In Table 8, the correspondence between the notations employed here and those in [<xref ref-type="bibr" rid="ref-78">78</xref>], p.206, is provided.</p>
<table-wrap id="table-8"><label>Table 8</label>
<caption>
<p>Equivalence of backprop Algorithm <xref ref-type="fig" rid="fig-167">9</xref> and Algorithm 6.4 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p.206. Comparison of notations. The mathematical expressions in Algorithm 6.4 are reproduced here in their original notations, except for the matrix dimensions, which were not given in Algorithm 6.4 of [<xref ref-type="bibr" rid="ref-78">78</xref>].</p></caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="left">Algorithm <xref ref-type="fig" rid="fig-167">9</xref>, current notation</th>
<th align="left">Goodfellow Algorithm 6.4, original notation</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x25CF; Layer index = &#x2113;</td>
<td align="left">&#x25CF; Layer index = <italic>k</italic></td>
</tr>
<tr>
<td align="left">&#x25CF; Gradient = <inline-formula id="ieqn-2876"><mml:math id="mml-ieqn-2876"><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> (row)</td>
<td align="left">&#x25CF; Gradient = <inline-formula id="ieqn-2878"><mml:math id="mml-ieqn-2878"><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (column)</td>
</tr>
<tr>
<td align="left">&#x25CF; Output of layer <inline-formula id="ieqn-2880"><mml:math id="mml-ieqn-2880"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> = <inline-formula id="ieqn-2881"><mml:math id="mml-ieqn-2881"><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (column)</td>
<td align="left">&#x25CF; Output of layer <inline-formula id="ieqn-2883"><mml:math id="mml-ieqn-2883"><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> = <inline-formula id="ieqn-2884"><mml:math id="mml-ieqn-2884"><mml:msup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (column)</td>
</tr>
<tr>
<td align="left">&#x25CF; Predicted output (last layer <inline-formula id="ieqn-2886"><mml:math id="mml-ieqn-2886"><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>): <inline-formula id="ieqn-2887"><mml:math id="mml-ieqn-2887"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="left">&#x25CF; Predicted output (last layer <inline-formula id="ieqn-2889"><mml:math id="mml-ieqn-2889"><mml:mo stretchy="false">(</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>): <inline-formula id="ieqn-2890"><mml:math id="mml-ieqn-2890"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">&#x25CF; Weighted inputs to layer <inline-formula id="ieqn-2892"><mml:math id="mml-ieqn-2892"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: <inline-formula id="ieqn-2893"><mml:math id="mml-ieqn-2893"><mml:msup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="left">&#x25CF; Pre-nonlinear activation for layer <inline-formula id="ieqn-2895"><mml:math id="mml-ieqn-2895"><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: <inline-formula id="ieqn-2896"><mml:math id="mml-ieqn-2896"><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">&#x25CF; Activation function: <inline-formula id="ieqn-2898"><mml:math id="mml-ieqn-2898"><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></td>
<td align="left">&#x25CF; Activation function: <inline-formula id="ieqn-2900"><mml:math id="mml-ieqn-2900"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">&#x25CF; Gradient on output of layer <inline-formula id="ieqn-2902"><mml:math id="mml-ieqn-2902"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</td>
<td align="left">&#x25CF; Gradient on output of layer <inline-formula id="ieqn-2904"><mml:math id="mml-ieqn-2904"><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2905"><mml:math id="mml-ieqn-2905"><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> (row)</td>
<td align="left"><inline-formula id="ieqn-2906"><mml:math id="mml-ieqn-2906"><mml:mrow><mml:msub><mml:mi>&#x2207;</mml:mi><mml:mrow><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> (column)</td>
</tr>
<tr>
<td align="left">&#x25CF; Gradient on weighted inputs:</td>
<td align="left">&#x25CF; Gradient on prenonlinear activation:</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2909"><mml:math id="mml-ieqn-2909"><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>&#x2299;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>row</mml:mtext><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-2910"><mml:math id="mml-ieqn-2910"><mml:mrow><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>&#x2299;</mml:mo><mml:mi>f</mml:mi><mml:mo>&#x2032;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext>col</mml:mtext><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">Code line <xref ref-type="fig" rid="fig-167">6</xref> in Algorithm <xref ref-type="fig" rid="fig-167">9</xref></td>
<td align="left">Code in Algorithm 6.4, [<xref ref-type="bibr" rid="ref-78">78</xref>]</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2911"><mml:math id="mml-ieqn-2911"><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2190;</mml:mo><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi><mml:mo>/</mml:mo><mml:mo>&#x2202;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mo>&#x2299;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0027;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> (row)</td>
<td align="left"><inline-formula id="ieqn-2912"><mml:math id="mml-ieqn-2912"><mml:mrow><mml:mi>g</mml:mi><mml:mo>&#x2190;</mml:mo><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mrow><mml:msup><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mo>&#x2299;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mo>&#x0027;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant="double-struck">R</mml:mi><mml:mi>m</mml:mi></mml:msup><mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> (column)</td>
</tr>
<tr>
<td align="left">&#x25CF; Layer parameters</td>
<td align="left">&#x25CF; Layer parameters</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2915"><mml:math id="mml-ieqn-2915"><mml:msup><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-2916"><mml:math id="mml-ieqn-2916"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left">&#x25CF; Gradient on layer parameters <inline-formula id="ieqn-2918"><mml:math id="mml-ieqn-2918"><mml:msup><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="left">&#x25CF; Gradient on layer parameters <inline-formula id="ieqn-2920"><mml:math id="mml-ieqn-2920"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2921"><mml:math id="mml-ieqn-2921"><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-2922"><mml:math id="mml-ieqn-2922"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2923"><mml:math id="mml-ieqn-2923"><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-2924"><mml:math id="mml-ieqn-2924"><mml:mrow><mml:msub><mml:mi>&#x2207;</mml:mi><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mtext>&#x2009;</mml:mtext><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:msub><mml:mi>&#x2207;</mml:mi><mml:mrow><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2926"><mml:math id="mml-ieqn-2926"><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-2927"><mml:math id="mml-ieqn-2927"><mml:mo>&#x2202;</mml:mo><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">b</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="left">(no regularization in cost function, <inline-formula id="ieqn-2928"><mml:math id="mml-ieqn-2928"><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula>)</td>
</tr>
<tr>
<td align="left">&#x25CF; Gradient on layer outputs <inline-formula id="ieqn-2930"><mml:math id="mml-ieqn-2930"><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td align="left">&#x25CF; Gradient on layer outputs <inline-formula id="ieqn-2932"><mml:math id="mml-ieqn-2932"><mml:msup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-2933"><mml:math id="mml-ieqn-2933"><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mi>J</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">r</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x2113;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> (row)</td>
<td align="left"><inline-formula id="ieqn-2934"><mml:math id="mml-ieqn-2934"><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:msup><mml:mi mathvariant="bold-italic">h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> (column)</td> </tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s16">
<title>2 Another LSTM block diagram</title>
<p>An alternative block diagram for the folded RNN with LSTM cell corresponding to Figure <xref ref-type="fig" rid="fig-81">81</xref> is shown Figure <xref ref-type="fig" rid="fig-152">152</xref> below:</p>
<fig id="fig-152">
<label>Figure 152</label>
<caption><title><italic>Folded RNN and LSTM cell, two feedback loops with delay, block diagram</italic>. Typical state <inline-formula id="ieqn-2935"><mml:math id="mml-ieqn-2935"><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. Corrections for the figures in (1) Figure 10.16 in the online book <ext-link ext-link-type="uri" xlink:href="https://www.deeplearningbook.org">Deep Learning</ext-link> by Goodfellow et al 2016, <ext-link ext-link-type="uri" xlink:href="https://www.deeplearningbook.org/contents/rnn.html">Chap.10</ext-link>, p.405 (referred to here as &#x201C;DL-A&#x201D;, or &#x201C;Deep Learning, version A&#x201D;), and (2) [<xref ref-type="bibr" rid="ref-78">78</xref>], p.398 (referred to as &#x201C;DL-B&#x201D;). (The above figure is adapted from a figure reproduced with permission of the authors.) </title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-152.tif"/>
</fig>
<p>Figure 10.16 in the online book <ext-link ext-link-type="uri" xlink:href="https://www.deeplearningbook.org">Deep Learning</ext-link> by Goodfellow et al. (2016), <ext-link ext-link-type="uri" xlink:href="https://www.deeplearningbook.org/contents/rnn.html">Chap.10</ext-link>, p.405 (referred to here as &#x201C;DL-A&#x201D;, or &#x201C;Deep Learning, version A&#x201D;), was either incomplete with missing important details. Even the <italic>updated</italic> Figure 10.16 in [<xref ref-type="bibr" rid="ref-78">78</xref>], p.398 (referred to as &#x201C;DL-B&#x201D;), was still incomplete (or incorrect).</p>
<p>The corrected arrows, added annotations, and colors correspond to those in the equivalent Figure <xref ref-type="fig" rid="fig-81">81</xref>. The corrections are described below.</p>
<p><bold><italic>Error 1:</italic></bold> The cell state <inline-formula id="ieqn-2936"><mml:math id="mml-ieqn-2936"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>c</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> should be squashed by the state sigmoidal activation function <inline-formula id="ieqn-2937"><mml:math id="mml-ieqn-2937"><mml:msub><mml:mi>&#x1D49C;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> with range <inline-formula id="ieqn-2938"><mml:math id="mml-ieqn-2938"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (brown dot, e.g., tanh function) before being multiplied by the scaling factor <inline-formula id="ieqn-2939"><mml:math id="mml-ieqn-2939"><mml:msub><mml:mi>&#x02131;</mml:mi><mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> coming from the output gate to produce the hidden state <inline-formula id="ieqn-2940"><mml:math id="mml-ieqn-2940"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. This correction is for both DL-A and DL-B.</p>
<p><bold><italic>Error 2:</italic></bold> The hidden-state feedback loop (green) should start from the hidden state <inline-formula id="ieqn-2941"><mml:math id="mml-ieqn-2941"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, delayed by one step, i.e., <inline-formula id="ieqn-2942"><mml:math id="mml-ieqn-2942"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, which is fed to all four gates: (1) The <italic>externally-input</italic> gate <inline-formula id="ieqn-2943"><mml:math id="mml-ieqn-2943"><mml:mi>g</mml:mi></mml:math></inline-formula> (purple box) with activation function <inline-formula id="ieqn-2944"><mml:math id="mml-ieqn-2944"><mml:msub><mml:mi>&#x1D49C;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> having the range <inline-formula id="ieqn-2945"><mml:math id="mml-ieqn-2945"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, (2) the <italic>input</italic> gate <inline-formula id="ieqn-2946"><mml:math id="mml-ieqn-2946"><mml:mi>&#x02110;</mml:mi></mml:math></inline-formula>, (3) the <italic>forget</italic> gate <inline-formula id="ieqn-2947"><mml:math id="mml-ieqn-2947"><mml:mi>f</mml:mi></mml:math></inline-formula>, (4) the <italic>output</italic> gate <inline-formula id="ieqn-2948"><mml:math id="mml-ieqn-2948"><mml:mi>&#x1D4AA;</mml:mi></mml:math></inline-formula>. The activations <inline-formula id="ieqn-2949"><mml:math id="mml-ieqn-2949"><mml:msub><mml:mi>&#x1D49C;</mml:mi><mml:mi>&#x03B1;</mml:mi></mml:msub></mml:math></inline-formula>, with <inline-formula id="ieqn-2950"><mml:math id="mml-ieqn-2950"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> (3 blue boxes) all have the interval <inline-formula id="ieqn-2951"><mml:math id="mml-ieqn-2951"><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as their range. This hidden-state feedback loop was missing in DL-A, whereas the hidden-state feedback loop in DL-B incorrectly started from the summation operation (grey circle, just below <inline-formula id="ieqn-2952"><mml:math id="mml-ieqn-2952"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>c</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>) in the cell-state feedback loop, and did not feed into the input gate <inline-formula id="ieqn-2953"><mml:math id="mml-ieqn-2953"><mml:mi>g</mml:mi></mml:math></inline-formula>.</p>
<p><bold><italic>Error 3:</italic></bold> Four pairs of arrows pointing into the four gates <inline-formula id="ieqn-2954"><mml:math id="mml-ieqn-2954"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>g</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mo>&#x02110;</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, with one pair per gate, were intended as inputs to these gates, but were without annotation, and thus unclear / confusing. Here, for each gate, one arrow is used for the hidden state <inline-formula id="ieqn-2955"><mml:math id="mml-ieqn-2955"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>h</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, and the other arrow for the input <inline-formula id="ieqn-2956"><mml:math id="mml-ieqn-2956"><mml:msup><mml:mrow><mml:mi mathvariant='bold-italic'>x</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. This correction is for both DL-A and DL-B.</p></sec>
<sec id="s17">  
<title>3 Conditional Gaussian distribution</title>
<p>The derivation of Eqs. (<xref ref-type="disp-formula" rid="eqn-377">377</xref>)-(<xref ref-type="disp-formula" rid="eqn-379">379</xref>) is provided here helps develop a better feel for the Gaussian distribution, and facilitates the understanding of the conditional Gaussian process posterior described in Section <xref ref-type="sec" rid="s8_3_2">8.3.2</xref>.</p>
<p>If two sets of variables have a joint Gaussian distribution, i.e., these two sets are jointly Gaussian, then the conditional probability distribution of one set given the other set is also Gaussian.<xref ref-type="fn" rid="fn339"><sup>339</sup></xref><fn id="fn339"><label>339</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 85.</p></fn> The two sets of variables considered here is the observed values in <inline-formula id="ieqn-2957"><mml:math id="mml-ieqn-2957"><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:math></inline-formula> and the test values in <inline-formula id="ieqn-2958"><mml:math id="mml-ieqn-2958"><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-376">376</xref>). Define</p>
<p><disp-formula id="eqn-518"><label>(518)</label><mml:math id="mml-eqn-518" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>:=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>:=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>}</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>:=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BD;</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03BD;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p><disp-formula id="eqn-519"><label>(519)</label><mml:math id="mml-eqn-519" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>:=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BD;</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;with&#xA0;</mml:mtext></mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable><mml:mo>,</mml:mo></mml:math></disp-formula> then expand the exponent in the Gaussian joint probability Eq. (<xref ref-type="disp-formula" rid="eqn-368">368</xref>), with <inline-formula id="ieqn-2959"><mml:math id="mml-ieqn-2959"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> instead of <inline-formula id="ieqn-2960"><mml:math id="mml-ieqn-2960"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, to have the Mahalanobis distance<xref ref-type="fn" rid="fn340"><sup>340</sup></xref><fn id="fn340"><label>340</label><p>See, e.g., [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 80.</p></fn> <inline-formula id="ieqn-2961"><mml:math id="mml-ieqn-2961"><mml:mi>D</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:math></inline-formula> squared, a quadratic form in terms of <inline-formula id="ieqn-2962"><mml:math id="mml-ieqn-2962"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, written as:</p>
<p><disp-formula id="eqn-520"><label>(520)</label><mml:math id="mml-eqn-520" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>:=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which is also a quadratic form in terms of <inline-formula id="ieqn-2963"><mml:math id="mml-ieqn-2963"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, based on the symmetry of <inline-formula id="ieqn-2964"><mml:math id="mml-ieqn-2964"><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:math></inline-formula>, the inverse of the covariance matrix, in Eq. (<xref ref-type="disp-formula" rid="eqn-519">519</xref>), implying that the distribution of the test values <inline-formula id="ieqn-2965"><mml:math id="mml-ieqn-2965"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> is Gaussian. The covariance matrix and the mean of the Gaussian distribution over <inline-formula id="ieqn-2966"><mml:math id="mml-ieqn-2966"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> are determined by identifying the quadratic term and the linear term in <inline-formula id="ieqn-2967"><mml:math id="mml-ieqn-2967"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula>, compared to the expansion of the general Gaussian distribution <inline-formula id="ieqn-2968"><mml:math id="mml-ieqn-2968"><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mtext>&#x2009;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> over the variable <inline-formula id="ieqn-2969"><mml:math id="mml-ieqn-2969"><mml:mrow><mml:mi mathvariant='bold-italic'>z</mml:mi></mml:mrow></mml:math></inline-formula> with mean <inline-formula id="ieqn-2970"><mml:math id="mml-ieqn-2970"><mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> and covariance matrix <inline-formula id="ieqn-2971"><mml:math id="mml-ieqn-2971"><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-521"><label>(521)</label><mml:math id="mml-eqn-521" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>&#xA0;constant</mml:mtext></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the constant is independent of <inline-formula id="ieqn-2972"><mml:math id="mml-ieqn-2972"><mml:mrow><mml:mi mathvariant='bold-italic'>z</mml:mi></mml:mrow></mml:math></inline-formula>. Expand Eq. (<xref ref-type="disp-formula" rid="eqn-520">520</xref>) to have</p>
<p><disp-formula id="eqn-522"><label>(522)</label><mml:math id="mml-eqn-522" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>&#xA0;constant</mml:mtext></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>and compare to Eq. (<xref ref-type="disp-formula" rid="eqn-521">521</xref>), then for the conditional distribution <inline-formula id="ieqn-2973"><mml:math id="mml-ieqn-2973"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of the test values <inline-formula id="ieqn-2974"><mml:math id="mml-ieqn-2974"><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x02DC;</mml:mo></mml:mover></mml:math></inline-formula> at <inline-formula id="ieqn-2975"><mml:math id="mml-ieqn-2975"><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> given the data <inline-formula id="ieqn-2976"><mml:math id="mml-ieqn-2976"><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the covariance matrix <inline-formula id="ieqn-2977"><mml:math id="mml-ieqn-2977"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> and the mean <inline-formula id="ieqn-2978"><mml:math id="mml-ieqn-2978"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are</p>
<p><disp-formula id="eqn-523"><label>(523)</label><mml:math id="mml-eqn-523" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-524"><label>(524)</label><mml:math id="mml-eqn-524" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-525"><label>(525)</label><mml:math id="mml-eqn-525" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>in which Eq. (<xref ref-type="disp-formula" rid="eqn-523">523</xref>)<inline-formula id="ieqn-2979"><mml:math id="mml-ieqn-2979"><mml:msub><mml:mi></mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> had been used.</p>
<p>At this point, the submatrices <inline-formula id="ieqn-2980"><mml:math id="mml-ieqn-2980"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-2981"><mml:math id="mml-ieqn-2981"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> can be expressed in terms of the submatrices of the partitioned matrix <inline-formula id="ieqn-2982"><mml:math id="mml-ieqn-2982"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> in Eq. (<xref ref-type="disp-formula" rid="eqn-518">518</xref>) as follows. From the definition of the matrix <inline-formula id="ieqn-2983"><mml:math id="mml-ieqn-2983"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>, the inverse of the covariance matrix <inline-formula id="ieqn-2984"><mml:math id="mml-ieqn-2984"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-526"><label>(526)</label><mml:math id="mml-eqn-526" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>the 2nd row gives rise to a system of two equations for two unknowns <inline-formula id="ieqn-2985"><mml:math id="mml-ieqn-2985"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-2986"><mml:math id="mml-ieqn-2986"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-527"><label>(527)</label><mml:math id="mml-eqn-527" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>in which the covariance matrix <inline-formula id="ieqn-2987"><mml:math id="mml-ieqn-2987"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> is symmetric, and is a particular case of the non-symmetric problem of expressing <inline-formula id="ieqn-2988"><mml:math id="mml-ieqn-2988"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> in terms of <inline-formula id="ieqn-2989"><mml:math id="mml-ieqn-2989"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-528"><label>(528)</label><mml:math id="mml-eqn-528" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>from the first equation, and leads to</p>
<p><disp-formula id="eqn-529"><label>(529)</label><mml:math id="mml-eqn-529" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-530"><label>(530)</label><mml:math id="mml-eqn-530" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">&#x21D2;</mml:mo><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>which, after using Eq. (<xref ref-type="disp-formula" rid="eqn-527">527</xref>) to identify <inline-formula id="ieqn-2990"><mml:math id="mml-ieqn-2990"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2991"><mml:math id="mml-ieqn-2991"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-2992"><mml:math id="mml-ieqn-2992"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and then when replaced in Eq. (<xref ref-type="disp-formula" rid="eqn-523">523</xref>) for the conditional covariance <inline-formula id="ieqn-2993"><mml:math id="mml-ieqn-2993"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula> and Eq. (<xref ref-type="disp-formula" rid="eqn-525">525</xref>) for the conditional mean <inline-formula id="ieqn-2994"><mml:math id="mml-ieqn-2994"><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub><mml:mrow></mml:mrow></mml:math></inline-formula>, yields Eq. (<xref ref-type="disp-formula" rid="eqn-379">379</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-378">378</xref>), respectively.</p> 
<statement id="st3_1"><title><xref ref-type="statement" rid="st3_1">Remark 3.1</xref>.</title>
<p>Another way to obtain indirectly Eq. (<xref ref-type="disp-formula" rid="eqn-379">379</xref>) and Eq. (<xref ref-type="disp-formula" rid="eqn-378">378</xref>), without derivation, is to use the identity of the inverse of a partitioned matrix in Eq. (<xref ref-type="disp-formula" rid="eqn-531">531</xref>), as done in [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 87:</p>
<p><disp-formula id="eqn-531"><label>(531)</label><mml:math id="mml-eqn-531" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mspace width="1em" /><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>:=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>This method is less satisfactory since without derivation, there is no feel of where the matrix elements in Eq. (<xref ref-type="disp-formula" rid="eqn-531">531</xref>) came from. In fact, the derivation of the 1st row of Eq. (<xref ref-type="disp-formula" rid="eqn-531">531</xref>) follows exactly the same line as for Eqs.(<xref ref-type="disp-formula" rid="eqn-528">528</xref>)-(<xref ref-type="disp-formula" rid="eqn-530">530</xref>). The 2nd row in Eq. (<xref ref-type="disp-formula" rid="eqn-531">531</xref>) looks complex, but before getting into its derivation, we note that exactly the same line of derivation for the 1st row can be straightforwardly followed to arrive at different, and simpler, expressions of the 2nd-row matrix elements <inline-formula id="ieqn-2995"><mml:math id="mml-ieqn-2995"><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-2996"><mml:math id="mml-ieqn-2996"><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which are similar to those in the 1st row, and which were already derived in Eq. (<xref ref-type="disp-formula" rid="eqn-530">530</xref>).</p>
<p>It can be easily verified that</p>
<p><disp-formula id="eqn-532"><label>(532)</label><mml:math id="mml-eqn-532" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>To derive the 2nd row of Eq. (<xref ref-type="disp-formula" rid="eqn-531">531</xref>), premultiply the 1st row (which had been derived as mentioned above) of Eq. (<xref ref-type="disp-formula" rid="eqn-532">532</xref>) by <inline-formula id="ieqn-2997"><mml:math id="mml-ieqn-2997"><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to have</p>
<p><disp-formula id="eqn-533"><label>(533)</label><mml:math id="mml-eqn-533" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>To make the right-hand side become <inline-formula id="ieqn-2998"><mml:math id="mml-ieqn-2998"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mi>I</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, add to both sides of Eq. (<xref ref-type="disp-formula" rid="eqn-533">533</xref>) the matrix <inline-formula id="ieqn-2999"><mml:math id="mml-ieqn-2999"><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi>R</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mi>I</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> to obtain Eq. (<xref ref-type="disp-formula" rid="eqn-532">532</xref>)&#x2019;s 2nd row, whose complex expressions did not contribute to the derivation of the conditional Gaussian posterior mean Eq. (<xref ref-type="disp-formula" rid="eqn-378">378</xref>) and covariance Eq. (<xref ref-type="disp-formula" rid="eqn-379">379</xref>).</p>
<p>Yet another way to derive Eqs. (<xref ref-type="disp-formula" rid="eqn-378">378</xref>)-(<xref ref-type="disp-formula" rid="eqn-379">379</xref>) is to use the more complex proof in [<xref ref-type="bibr" rid="ref-248">248</xref>], p. 429, which was referred to in [<xref ref-type="bibr" rid="ref-234">234</xref>], p. 200 (see also Footnote <xref ref-type="fn" rid="fn242">242</xref>).&#x2003;&#x2003;&#x2003;&#x2003;&#x25A0;</p></statement>
<p>In summary, the above derivation is simpler and more direct than in [<xref ref-type="bibr" rid="ref-130">130</xref>], p. 87, and in [<xref ref-type="bibr" rid="ref-248">248</xref>], p. 429.</p>
<fig id="fig-153">
<label>Figure 153</label>
<caption><title><italic>The first two waves of AI</italic>, according to [<xref ref-type="bibr" rid="ref-78">78</xref>], p.13, showing the &#x201C;cybernetics&#x201D; wave (blue line) started in the 1940s peaked before 1970, then gradually declined toward 2006 and beyond. The results were based on a search for frequency of words in Google Books. It was mentioned, incorrectly, that the work of Rosenblatt (1957-1962) [<xref ref-type="bibr" rid="ref-1">1</xref>]-[<xref ref-type="bibr" rid="ref-2">2</xref>] was limited to one neuron; see Figure <xref ref-type="fig" rid="fig-42">42</xref> and Figure <xref ref-type="fig" rid="fig-133">133</xref>. (Figure reproduced with permission of the authors.)</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-153.tif"/>
</fig>
</sec>
<sec id="s18">
<title>4 The ups and downs of AI, cybernetics</title>
<p>The authors of [<xref ref-type="bibr" rid="ref-78">78</xref>], p.13, divided the wax-and-wane fate of AI into three waves, with the first wave called the &#x201C;cybernetics&#x201D; that started in the 1940s, peaked before 1970, then began a gradual descent toward 1986, when the second wave picked up with the publication of [<xref ref-type="bibr" rid="ref-22">22</xref>] on an application of backpropagation to psychology; see Section <xref ref-type="sec" rid="s13_4_1">13.4.1</xref> on a history of backpropagation. Since Goodfellow (the first author of [<xref ref-type="bibr" rid="ref-78">78</xref>]) worked at Google at the time, and would have access to the scanned books in the Google Books collection to do the search. For a concise historical account of &#x201C;cybernetics&#x201D;, see [<xref ref-type="bibr" rid="ref-425">425</xref>].</p>
<p>We had to rely on Web of Science to do the &#x201C;topic&#x201D; search for the keyword &#x201C;cyberneti*&#x201D;, i.e., using the query <monospace>ts=(cyberneti*)</monospace>, with &#x201C;*&#x201D; being the search wildcard, which can stand for any character that follows. Figure <xref ref-type="fig" rid="fig-154">154</xref> is the result,<xref ref-type="fn" rid="fn341"><sup>341</sup></xref><fn id="fn341"><label>341</label><p>The total number of papers on the topic &#x201C;cyberneti*&#x201D; was 7,962 on 2020.04.15&#x2013;as shown in Figure <xref ref-type="fig" rid="fig-154">154</xref> obtained upon clicking on the &#x201C;Citation Report&#x201D; button in the Web of Science&#x2013;and 8,991 on 2022.08.08. Since the distribution in Figure <xref ref-type="fig" rid="fig-154">154</xref>, the points made in the figure caption and in this section remain the same, there was no need to update the figure to its 2022.08.08 version.</p></fn> spanning an astoundingly vast and diverse number of more than 100 categories,<xref ref-type="fn" rid="fn342"><sup>342</sup></xref><fn id="fn342"><label>342</label><p>The number of categories has increased to 244 in the Web of Sciecne search on 2022.08.08, mentioned in Footnote <xref ref-type="fn" rid="fn341">341</xref>, with the number of papers in Computer Science Cybernetics at 2,952, representing 32% of the 8,991 papers in this topic.</p></fn> listed in descending order of number of papers in parentheses: Computer Science Cybernetics (2,665 papers), Computer Science Artificial Intelligence (601), Engineering Electrical Electronic (459),..., Philosophy (229),..., Social Sciences Interdisciplinary (225),..., Business (132),..., Psychology Multidisciplinary (128),..., Psychiatry (90),..., Art (66),..., Business Finance (43),..., Music (31),..., Religion (27),..., Cell biology (21),..., Law (21),...</p>
<fig id="fig-154">
<label>Figure 154</label>
<caption><title><italic>Cybernetics papers</italic>, (Appendix <xref ref-type="sec" rid="s18">4</xref>). Web of Science search on 2020.04.15, having more than 100 Web of Science categories. The first paper was [<xref ref-type="bibr" rid="ref-426">426</xref>]. There was no clear wave that crested before 1970, but actually the number of papers in Cybernetics continue to increase over the years.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-154.tif"/>
</fig>
<p>The first paper in 1949 [<xref ref-type="bibr" rid="ref-426">426</xref>] was categorized as Mathematics. More recent papers include Biological Science, e.g., [<xref ref-type="bibr" rid="ref-427">427</xref>], Building Construction, e.g., [<xref ref-type="bibr" rid="ref-428">428</xref>], Accounting, e.g., [<xref ref-type="bibr" rid="ref-429">429</xref>].</p>
<p>It is interesting to note that McCulloch who co-authored the well-known paper [<xref ref-type="bibr" rid="ref-430">430</xref>] was part of the original cybernetics movement that started in the 1940s, as noted in [<xref ref-type="bibr" rid="ref-431">431</xref>]:</p>
<disp-quote><p>&#x201C;Warren McCulloch, the &#x201C;chronic chairman&#x201D; and founder of the cybernetics conferences.<xref ref-type="fn" rid="fn343"><sup>343</sup></xref><fn id="fn343"><label>343</label><p>These cybernetics conferences were called the Macy conferences, held during a short period from 1946 to 1953, and involved researchers from diverse fields: not just mathematics, physics, engineering, but also anthropology and physiology, [<xref ref-type="bibr" rid="ref-431">431</xref>], pp.2-3.</p></fn> An eccentric physiologist, McCulloch had coauthored a foundational article of cybernetics on the brain&#x2019;s neural network.&#x201D;</p>
</disp-quote><p>But McCulloch &amp; Pitt&#x2019;s 1943 paper [<xref ref-type="bibr" rid="ref-430">430</xref>]&#x2013;often cited in artificial-neural-network papers (e.g., [<xref ref-type="bibr" rid="ref-23">23</xref>], [<xref ref-type="bibr" rid="ref-12">12</xref>]) and books (e.g., [<xref ref-type="bibr" rid="ref-78">78</xref>]), and dated six years before [<xref ref-type="bibr" rid="ref-426">426</xref>]&#x2013;was placed in the Web of Science category &#x201C;Biology; Mathematical &amp; Computational Biology,&#x201D; and thus did not show up in the search with keyword &#x201C;cyberneti*&#x201D; shown in Figure <xref ref-type="fig" rid="fig-154">154</xref>. A reason is [<xref ref-type="bibr" rid="ref-430">430</xref>] did not contain the word &#x201C;cybernetics,&#x201D; which was not invented until 1948 with the famous book by Wiener, and which was part of the title of [<xref ref-type="bibr" rid="ref-426">426</xref>]. Cybernetics was a &#x201C;new science&#x201D; with a &#x201C;mysterious name and universal aspirations&#x201D; [<xref ref-type="bibr" rid="ref-431">431</xref>], p.5.</p>

<disp-quote><p>&#x201C;What exactly is (or was) cybernetics? This has been a perennial ongoing topic of debate within the American Society for Cybernetics throughout its 50-year history.... the word has a much older history reaching back to Plato, Amp&#x00E8;re (&#x201C;Cybern&#x00E9;tique = the art of growing&#x201D;), and others. &#x201C;Cybernetics&#x201D; comes from the Greek word for governance, <italic>kybernetike</italic>, and the related word, <italic>kybernetes</italic>, steersman or captain&#x201D; [<xref ref-type="bibr" rid="ref-432">432</xref>].</p>
</disp-quote>
<p>Steering a ship is controlling its direction. [<xref ref-type="bibr" rid="ref-433">433</xref>] defined cybernetics as</p>
<fig id="fig-155">
<label>Figure 155</label>
<caption><title><italic>Cybernetics papers</italic>, (Appendix <xref ref-type="sec" rid="s18">4</xref>). Web of Science search on 2020.04.17, ALL Computer-Science categories (3,555 papers): Cybernetics (2,666), Artificial Intelligence (602), Information Systems (432), Theory Methods (300), Interdisciplinary Applications (293), Software Engineering (163). The wave crest was in 2007, with a tiny bump in 1980.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-155.tif"/></fig>
<disp-quote><p>&#x201C;... (feedback) control and communication theory pertinent to the description, analysis, or construction of systems that involve (1) mechanisms (receptors) for the reception of messages or stimuli, (2) means (circuits) for communication of these to (3) a central control unit that responds by feeding back through the system (4) instructions that (will or tend to) produce specific actions on the part of (5) particular elements (effectors) of the system.... The central concept in cybernetics is a feedback mechanism that, in response to information (stimuli, messages) received through the system, feeds back to the system instructions that modify or otherwise alter the performance of the system.&#x201D;</p>
</disp-quote><p>Even though [<xref ref-type="bibr" rid="ref-432">432</xref>] did not use the word &#x201C;control&#x201D;, the definition is similar:</p>
<disp-quote><p>&#x201C;The core concepts involved natural and artificial systems organized to attain internal stability (homeostasis), to adjust internal structure and behavior in light of experience (adaptive, self-organizing systems), and to pursue autonomous goal-directed (purposeful, purposive) behavior.&#x201D; [<xref ref-type="bibr" rid="ref-432">432</xref>]</p>
</disp-quote><p>and is succinctly summarized by [<xref ref-type="bibr" rid="ref-434">434</xref>]:</p>
<disp-quote><p>&#x201C;If &#x201C;cybernetics&#x201D; means &#x201C;control and communication,&#x201D; what does it not mean? It would be difficult to think of any process in which nothing is either controlled or communicated.&#x201D;</p>
</disp-quote><p>which is the reason why cybernetics is found in a large number of different fields. [<xref ref-type="bibr" rid="ref-431">431</xref>], p.4, offered a similar, more detailed explanation of cybernetics as encompassing all fields of knowledge:</p>
<disp-quote><p>&#x201C;Wiener and Shannon defined the amount of information transmitted in communications systems with a formula mathematically equivalent to entropy (a measure of the degradation of energy). Defining information in terms of one of the pillars of physics convinced many re searchers that information theory could bridge the physical, biological, and social sciences. The allure of cybernetics rested on its promise to model mathematically the purposeful behavior of all organisms, as well as inanimate systems. Because cybernetics included information theory in its purview, its proponents thought it was more universal than Shannon&#x2019;s theory, that it applied to all fields of knowledge.&#x201D;</p>
</disp-quote>
<fig id="fig-156">
<label>Figure 156</label>
<caption><title><italic>Cybernetics papers</italic>, (Appendix <xref ref-type="sec" rid="s18">4</xref>). Web of Science search on 2020.04.15 (two days before Figure <xref ref-type="fig" rid="fig-155">155</xref>), category Computer Science Cybernetics (2,665 papers). The wave crest was in 2007, with a tiny bump in 1980. </title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-156.tif"/>
</fig>
<fig id="fig-157">
<label>Figure 157</label>
<caption><title><italic>Cybernetics papers</italic>, (Appendix <xref ref-type="sec" rid="s18">4</xref>). Web of Science search on 2020.04.15 (two days before Figure <xref ref-type="fig" rid="fig-155">155</xref>), category Computer Science Artificial Intelligence (601 papers). Similar to Figure <xref ref-type="fig" rid="fig-156">156</xref>, the wave crest was in 2007, but with no tiny bump in 1980, since the first paper was in 1982.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-157.tif"/>
</fig>
<p>In 1969, the then president of the International Association of Cybernetics asked &#x201C;But after all what is cybernetics? Or rather what is it not, for paradoxically the more people talk about cybernetics the less they seem to agree on a definition,&#x201D; then identified several meanings: A mathematical control theory, automation, computerization, communication theory, study of human-machine analogies, philosophy explaining the mysteries of life! [<xref ref-type="bibr" rid="ref-431">431</xref>], p.5.</p>
<fig id="fig-158">
<label>Figure 158</label>
<caption><title><italic>Artificial Intelligence</italic> (AI), <italic>Machine Learning</italic> (ML), and <italic>Deep Learning</italic> (DL). <italic>Cybernetics</italic> is broad and encompasses many fields, including AI. See also Figure <xref ref-type="fig" rid="fig-6">6</xref>.</title></caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_28130-fig-158.tif"/>
</fig>
<p>So was there a first wave in AI called &#x201C;cybernetics&#x201D; ? Back in Oct 2018, we conveyed our search result at that time&#x2013;which was similar to Figure <xref ref-type="fig" rid="fig-154">154</xref>, but clearly did not support the existence of the cybernetics wave shown in Figure <xref ref-type="fig" rid="fig-153">153</xref>&#x2013;to Y. Bengio of [<xref ref-type="bibr" rid="ref-78">78</xref>], who then replied:</p>
<disp-quote><p>&#x201C;Ian [Goodfellow] did those figures, but my take on your observations is that the later surge in &#x2019;new cybernetics&#x2019; does not have much more to do with artificial neural networks. I&#x2019;m not sure why the Google Books search did not catch that usage, though.&#x201D;</p>
</disp-quote><p>We then selected only the categories that had the words &#x201C;Computer Science&#x201D; in their names; there were only six such categories among more than 100 categories, as shown in Figure <xref ref-type="fig" rid="fig-155">155</xref>. A similar figure obtained in Oct 2018 was also shared with Bengio, who had no further comment. The wave crest in Figure <xref ref-type="fig" rid="fig-155">155</xref> occurred in 2007, with a tiny bump in 1980, but not before 1970 as in Figure <xref ref-type="fig" rid="fig-153">153</xref>.</p>
<p>Figure 156 is the histogram for the largest single category Computer Science Cybernetics with 2,665 papers. In this figure, similar to Figure <xref ref-type="fig" rid="fig-155">155</xref>, the wave crest here also occurred in 2007, with a tiny bump in 1980.</p>
<p>Figure 157 is the histogram for the category Computer Science Artificial Intelligence with 601 papers. Again here, similar to Figure <xref ref-type="fig" rid="fig-155">155</xref> and Figure <xref ref-type="fig" rid="fig-156">156</xref>, the wave crest here also occurred in 2007, with no bump in 1980. The first document, a 5-year plan report of Latvia, appeared in 1982. There is a large &#x201C;impulse&#x201D; of number of papers in 2007, and a smaller &#x201C;impulse&#x201D; in 2014, but no smooth bump. There were no papers for 9 years between 1982 and 1992, in which a single paper appeared in the series &#x201C;Lecture Notes in Artificial Intelligence&#x201D; on cooperative agents.</p>
<p>Cybernetics, including the original cybernetics moment, as described in [<xref ref-type="bibr" rid="ref-431">431</xref>], encompassed many fields and involved many researchers not working on neural nets, such as Wiener, John von Neumann, Margaret Mead (anthropologist), etc., whereas the physiologist McCulloch co-authored the first &#x201C;foundational article of cybernetics on the brain&#x2019;s neural network&#x201D;. So it is not easy to attribute even the original cybernetic moment to research on neural nets alone. Moreover, many topics of interest to researchers at the time involve natural systems (including [<xref ref-type="bibr" rid="ref-430">430</xref>]), and thus natural intelligence, instead of artificial intelligence.</p></sec></app>
</app-group>
</back>
</article>