<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="style/jpub3-html-trans.xsl"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMES</journal-id>
<journal-id journal-id-type="nlm-ta">CMES</journal-id>
<journal-id journal-id-type="publisher-id">CMES</journal-id>
<journal-title-group>
<journal-title>Computer Modeling in Engineering &#x0026; Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1526-1506</issn>
<issn pub-type="ppub">1526-1492</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">20128</article-id>
<article-id pub-id-type="doi">10.32604/cmes.2022.020128</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Optimizing Big Data Retrieval and Job Scheduling Using Deep Learning Approaches</article-title>
<alt-title alt-title-type="left-running-head">Optimizing Big Data Retrieval and Job Scheduling Using Deep Learning Approaches</alt-title>
<alt-title alt-title-type="right-running-head">Optimizing Big Data Retrieval and Job Scheduling Using Deep Learning Approaches</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Chang</surname><given-names>Bao Rong</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Tsai</surname><given-names>Hsiu-Fen</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>sftsai@kmu.edu.tw</email>
</contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Lin</surname><given-names>Yu-Chieh</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Department of Computer Science and Information Engineering, National University of Kaohsiung</institution>, <addr-line>Kaohsiung</addr-line>, <country>Taiwan</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Fragrance and Cosmetic Science, Kaohsiung Medical University</institution>, <addr-line>Kaohsiung</addr-line>, <country>Taiwan</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Hsiu-Fen Tsai. Email: <email>sftsai@kmu.edu.tw</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-08-26"><day>26</day>
<month>08</month>
<year>2022</year></pub-date>
<volume>134</volume>
<issue>2</issue>
<fpage>783</fpage>
<lpage>815</lpage>
<history>
<date date-type="received"><day>05</day><month>11</month><year>2021</year></date>
<date date-type="accepted"><day>28</day><month>3</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2022 Chang et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Chang et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMES_20128.pdf"></self-uri>
<abstract>
<p>Big data analytics in business intelligence do not provide effective data retrieval methods and job scheduling that will cause execution inefficiency and low system throughput. This paper aims to enhance the capability of data retrieval and job scheduling to speed up the operation of big data analytics to overcome inefficiency and low throughput problems. First, integrating stacked sparse autoencoder and Elasticsearch indexing explored fast data searching and distributed indexing, which reduces the search scope of the database and dramatically speeds up data searching. Next, exploiting a deep neural network to predict the approximate execution time of a job gives prioritized job scheduling based on the shortest job first, which reduces the average waiting time of job execution. As a result, the proposed data retrieval approach outperforms the previous method using a deep autoencoder and Solr indexing, significantly improving the speed of data retrieval up to 53&#x0025; and increasing system throughput by 53&#x0025;. On the other hand, the proposed job scheduling algorithm defeats both first-in-first-out and memory-sensitive heterogeneous early finish time scheduling algorithms, effectively shortening the average waiting time up to 5&#x0025; and average weighted turnaround time by 19&#x0025;, respectively.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Stacked sparse autoencoder</kwd>
<kwd>Elasticsearch</kwd>
<kwd>distributed indexing</kwd>
<kwd>data retrieval</kwd>
<kwd>deep neural network</kwd>
<kwd>job scheduling</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>In recent years, the rapid growth of the amount of data coupled with the declining cost of storage equipment, the evolution of software technology, and the maturity of the cloud environment led to the rapid development of big data analytics [<xref ref-type="bibr" rid="ref-1">1</xref>]. When the amount of data is enormous, and the speed of data flow is fast, traditional methods are no longer for dealing with data storage, computing, and analysis in time, and excessively large-scale access will also cause I/O severe delay in a system. Faced with such explosive growth of large amounts of data, implementing tremendous distributed computing [<xref ref-type="bibr" rid="ref-2">2</xref>] with large-scale data clustering and type of not-only-SQL (NoSQL) storage technology has become a popular solution in recent years. Apache Hadoop or Spark [<xref ref-type="bibr" rid="ref-3">3</xref>] is currently the most widely known big data analytics platform in business intelligence with decentralized computing capabilities. Each of them is very suitable for typical extract-transform-load (ETL) workloads [<xref ref-type="bibr" rid="ref-4">4</xref>] due to its large-scale scalability and relatively low cost. Hadoop uses first-in-first-out (FIFO) [<xref ref-type="bibr" rid="ref-5">5</xref>] scheduling by default to prioritize jobs in the order in which they arrive. Although this kind of scheduling is relatively fair, it is likely to cause low system throughput. On the other hand, the response time of data retrieval using a traditional decentralized system is lengthy, causing inefficient job execution. Therefore, how to improve the efficiency of data retrieval and job scheduling becomes a crucial issue of big data analytics in business intelligence.</p>
<p>In terms of artificial intelligence development, deep learning [<xref ref-type="bibr" rid="ref-6">6</xref>] has rapidly developed in recent years. AlphaGo [<xref ref-type="bibr" rid="ref-7">7</xref>], developed by Google DeepMind in London, UK, defeated many Go masters in 2014. AI has become a hot research topic once again. The use of machine learning [<xref ref-type="bibr" rid="ref-8">8</xref>] or deep learning in various aspects of research and related applications is constantly prosperous. IBM developed a novel deep learning technology that mimics the working principle of the human brain, which can significantly reduce the speed of processing a large amount of data. Deep learning is a branch of artificial intelligence, and today&#x2019;s technology giants Facebook, Amazon, and Google focus on its related development for many innovations. In this era, the explosive growth of data will exist, and how to process and analyze a large amount of data effectively has become an important topic.</p>
<p>Nowadays, data retrieval and job scheduling research mainly focuses on using Hadoop and Spark open-source big data platforms in business intelligence systems to improve efficiency. Generally speaking, this study considers encoding and indexing technologies to realize fast data retrieval to increase the system throughput in big data analytics. Developing high-performance job scheduling in big data analytics reduces the time of job waiting for execution in a queue. The objective of this paper is to develop an advanced deep learning model together with a high-performed indexing engine that can beat the previous method, integrating deep autoencoder [<xref ref-type="bibr" rid="ref-9">9</xref>] (DAE) and Solr indexing [<xref ref-type="bibr" rid="ref-10">10</xref>] (abbreviated DAE-SOLR), over the speed of big data retrieval significantly. Based on deep learning model, this paper explored integrating stacked sparse autoencoder (SSAE) [<xref ref-type="bibr" rid="ref-11">11</xref>] and Elasticsearch indexing [<xref ref-type="bibr" rid="ref-12">12</xref>], abbreviated SSAE-ES, to create a fast approach of data searching and distributed indexing, which can reduce the search scope of the database and dramatically speeds up data searching. Implementing the proposed SSAE-ES can outperform the previous method DAE-SOLR with higher data retrieval efficiency and system throughput for big data analytics in business intelligence. On the other hand, this paper tried to exploit deep neural networks (DNN) [<xref ref-type="bibr" rid="ref-13">13</xref>] to predict the approximate execution time of jobs and give prioritized job scheduling based on the shortest job first (SJF) [<xref ref-type="bibr" rid="ref-14">14</xref>], which can reduce the average waiting time of job execution and average weighted turnaround time [<xref ref-type="bibr" rid="ref-15">15</xref>].</p>
</sec>
<sec id="s2"><label>2</label><title>Related Work</title>
<sec id="s2_1"><label>2.1</label><title>Literature Review</title>
<p>The real-time data processing and analysis are getting higher and higher for big data analytics in business intelligence. To pursue better performance of data analysis, the job schedule plays a vital role in improving the performance of big data analytics. For example, Yeh et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] proposed the user can dynamically assign priorities for each job to speed up the execution speed of the job. However, it lacks automatic functions to get the job started and must get permission from the user every time. Thangaselvi et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] mentioned the user could import self-adaptive MapReduce (SAMR) algorithm into Hadoop, which can adjust the parameters recalling the historical information saved on each node, thereby dynamically finding slow jobs. But SAMR works based on the model established by K-means [<xref ref-type="bibr" rid="ref-18">18</xref>]. A K-means approach usually has specific assumptions about the data distribution, while DNN with hierarchical feature learning has no explicit assumptions about the data. Therefore, DNN can establish a more complex data distribution than K-means to get better prediction results.</p>
<p>Many studies even apply deep learning to the distributed computing nodes. For example, Marquez et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] used the concept of deep cascade learning to combine Spark&#x2019;s distributed computing using multi-layer perceptrons to build a model that can perform large-scale data analysis in a short time [<xref ref-type="bibr" rid="ref-20">20</xref>]. However, there will still be a longer waiting time for job execution due to this model&#x2019;s, i.e., FIFO, lack of job scheduling optimization. Likewise, Lee et al. [<xref ref-type="bibr" rid="ref-21">21</xref>] gave data clustering with deep autoencoder and Solr indexing to significantly improve the query performance. However, the system lacks suitable job scheduling. Besides, Chang et al. [<xref ref-type="bibr" rid="ref-22">22</xref>] proposed the memory-sensitive heterogeneous early finish time (MSHEFT) algorithm for improving job scheduling. Unfortunately, when encountering the same size as data in the different jobs, it performs scheduling just like a FIFO, and nothing can be improved at all.</p>
<p>Nature-inspired optimization can give some searching advantages by using meta-heuristic-like approaches rather than deep learning models. L. Abualigah et al. proposed a novel population-based optimization method called Aquila Optimizer (AO) [<xref ref-type="bibr" rid="ref-23">23</xref>], which the Aquila&#x2019;s behaviors have inspired nature while catching the prey. Abualigah et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] proposed a novel nature-inspired meta-heuristic optimizer, called Reptile Search Algorithm (RSA), motivated by the hunting behavior of Crocodiles. Regarding the arithmetic optimization algorithm used for searching the target data, Abualigah et al. [<xref ref-type="bibr" rid="ref-25">25</xref>] presented a comprehensive survey of the Internet of Drones (IoD) and its applications, deployments, and integration. Integration of IoD includes privacy protection, security authentication, neural network, blockchain, and optimization based-method. Instead, Time-consuming will be a crucial problem about the cost when people consider meta-heuristic-like approaches or the arithmetic optimization algorithm for searching in significant data retrieval issues.</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>Encoder for Data Clustering</title>
<p>Deep autoencoder (DAE) is an unsupervised learning model of neural networks. Its model architecture includes two fully connected feedforward networks, called encoder and decoder, which perform compression and decompression. After training AE, the user reserves the encoder and can input the data into the encoder. The encoder outputs a point located in the three-dimensional space, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. Three-dimensional space divides itself into eight quadrants according to the X, Y, and Z axes, and the quadrants numbering from 1&#x223C;8 are inserted into the last column of the table, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Data set mapping to 3D coordinates</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-1.png"/></fig><fig id="fig-2"><label>Figure 2</label><caption><title>Encoder for data clustering</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-2.png"/></fig>
</sec>
<sec id="s2_3"><label>2.3</label><title>Stacked Sparse Autoencoder (SSAE)</title>
<p>Since training AE will activate the neurons of a hidden layer too frequently, AE may easily result in overfitting due to the degree of freedom being more considerable. A sparse constraint is added to the encoder part to reduce the number of activations of each neuron for every input signal. A specific input signal can only activate some neurons and will probably inactivate the others. In other words, each input signal of the sparse autoencoder cannot activate all of the neurons every time during the training phase, as shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>Sparse autoencoder model</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-3.png"/></fig>
<p>The autoencoder with sparse constraints is called sparse autoencoder (SAE) [<xref ref-type="bibr" rid="ref-26">26</xref>]. SAE defines loss function on <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, where L represents the loss function without sparse constraint, <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>K</mml:mi><mml:mi>L</mml:mi></mml:math></inline-formula> stands for <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>K</mml:mi><mml:mi>L</mml:mi></mml:math></inline-formula> divergence [<xref ref-type="bibr" rid="ref-27">27</xref>], <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula> denotes the expected activation degree of neurons in the network (if the activation function is a Sigmoid function, and SAE sets its value to 0.05, which means that most of the neural cell is not activated), and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the average activation degree of the j<sup>th</sup> neuron. Here, people defined <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>K</mml:mi><mml:mi>L</mml:mi></mml:math></inline-formula> on <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref> and use <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>K</mml:mi><mml:mi>L</mml:mi></mml:math></inline-formula> divergence to measure the similarity between the average activation output of hidden layer nodes and the sparsity <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>&#x03C1;</mml:mi></mml:math></inline-formula>. <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref> definds <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> the average activation degree on the training sample set in which <italic>m</italic> represents the number of training samples, and <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> stand for the response output of the j<sup>th</sup> node in the hidden layer to i samples.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:munder><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mi>j</mml:mi></mml:munder><mml:mo>&#x2061;</mml:mo><mml:mi>K</mml:mi><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>K</mml:mi><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mfrac><mml:mi>&#x03C1;</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mover><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C1;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mover><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03C1;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>\;&#xA0;</mml:mtext></mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>m</mml:mi></mml:mfrac><mml:mrow><mml:mtext>\;&#xA0;</mml:mtext></mml:mrow><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>After training multiple SAEs in which the user tuned each SAE layer by layer, the user finally stacked multiple SAEs up and called it stacked sparse autoencoder (SSAE), as shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. The encoded output of the previous stage acts as the input of the next stage. After the training of the first SAE in Stage 1, the user can obtain the first hidden layer with m neurons denoted Hidden 1. Likewise, by cloning Hidden 1 to the second SAE in Stage 2, the user can obtain the second hidden layer with n neurons denoted Hidden 2. Finally, the user stacks two SAEs to form a stacked sparse autoencoder (SSAE). Based on the ahead of the hidden layer, the current hidden layer can generate a new set of features. Hopefully, the user can get more hidden layers according to layer-by-layer training in the different stages.</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>Layer-wise pre-training for two SAEs and stack them up</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-4.png"/></fig>
</sec>
<sec id="s2_4"><label>2.4</label><title>Data Retrieval by Indexing</title>
<p>According to the quadrant value of the last field, the data is divided into different files and then sent to Solr for data indexing. To cope with large-scale data retrieval, the longer it takes when the index volume is enormous, distributed indexes can speed up the search time. SolrCloud reduces the pressure of single-machine processing, and multiple servers need to complete indexing together. Its architecture is shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>Solr architecture</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-5.png"/></fig>
<p>After creating an index (collection) of data, Solr divided index into multiple shards, and each shard has a leader for job distribution, as shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. When the replica completes the job, it will send the result back to the leader, and then the leader will send the result back to SolrCloud for the final result. The user can upload the file to any replica when the data is uploaded. If it is not the leader, it will forward the request to the leader of the same shard, and the leader will give the file path to each replica in the same shard. If the file does not belong to the same shard, the leader will transfer it to another leader in its corresponding shard. The corresponding leader will also give the file path to each replica in the same shard. The user manually uploaded the data to Solr after completing data clustering using the encoder process for subsequent node use. The flowchart of data preprocessing illustrates how to prepare the data clustering together with indexing, as shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>.</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>SolrCloud architecture</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-6.png"/></fig><fig id="fig-7"><label>Figure 7</label><caption><title>Data preprocessing flow</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-7.png"/></fig>
</sec>
<sec id="s2_5"><label>2.5</label><title>SQL Interface Selection</title>
<p>Developing an automatic SQL interface selection lets users choose an appropriate SQL interface [<xref ref-type="bibr" rid="ref-28">28</xref>], such as Hive, Impala, or SparkSQL, for performing SQL commands with the best efficiency. The appropriate SQL interface selection is valid according to the remaining memory size given in a cluster node. After trial and error, we found two critical points labeled as L1 and L2, where L1 represents the amount of memory around 5 GB and L2 10 GB. In this study, we set each node the amount of memory with 20 GB. Thus it defines three memory zones ranging from 0&#x223C;5 GB, 5 GB&#x223C;10 GB, and 10 GB&#x223C;20 GB, as shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. When the remaining memory size is less than L1, the system will automatically select the Hive interface to perform SQL commands. The system will use the Impala interface between L1 and L2. If it is more than L2, the system will choose the SparkSQL interface, as shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. We note that the Impala and SparkSQL interfaces need more memory to run the job having a large amount of data. Given sufficient memory, SparkSQL with in-memory computing capability can run the best efficiency in big data analytics. On the contrary, the Hive interface can run any job at a low level of memory size but is time-consuming.</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>SQL interface selection</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-8.png"/></fig>
</sec>
<sec id="s2_6"><label>2.6</label><title>Job Scheduling Algorithm MSHEFT</title>
<p>Heterogeneous early finish time (HEFT) [<xref ref-type="bibr" rid="ref-29">29</xref>] is a heuristic way to schedule a set of dependent jobs onto a network of heterogeneous workers that takes communication time into account. HEFT works on a list out of the scheduling algorithm that establishes a priority list first. HEFT will assign each job to the appropriate CPU to follow the sorted priority list to complete the job as soon as possible. People modified the HEFT algorithm to the memory-sensitive heterogeneous early finish time (MSHEFT) algorithm [<xref ref-type="bibr" rid="ref-22">22</xref>] that first considers the job&#x2019;s priority, then the data size, and, finally, the remaining memory size to choose the appropriate SQL command interface automatically. The flow chart of job processing with the MSHEFT algorithm and automatic SQL interface selection, as shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>.</p>
<fig id="fig-9"><label>Figure 9</label><caption><title>Job scheduling with MSHEFT algorithm and SQL interface selection</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-9.png"/></fig>
</sec>
<sec id="s2_7"><label>2.7</label><title>Two-Class Caching in Data Retrieval</title>
<p>Two-class caching is usually used to maximize the distributed system&#x2019;s performance and optimize data retrieval and job scheduling. Two-class caching consists of in-memory and in-disk caches to save excessive hardware resource consumption caused by repeated searches. Two-class caching describes the flow in the following, as shown in <xref ref-type="fig" rid="fig-10">Fig. 10</xref>.</p>
<fig id="fig-10"><label>Figure 10</label><caption><title>The flow of two-class caching in data retrieval</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-10.png"/></fig>
</sec>
<sec id="s2_8"><label>2.8</label><title>In-Memory Cache Mechanism</title>
<p>In terms of in-memory cache design, Memcached [<xref ref-type="bibr" rid="ref-30">30</xref>], a distributed high-speed fast memory system, temporarily stores data. Memcached can store the data in the memory in the form of key-value. A colossal hash table is constructed in the memory so that users can quickly query the corresponding data value through the key-value. Therefore, it is often used as a cache function to speed up the access performance of the system. However, the Memcached system has specific restrictions on its use. The default maximum length of the key-value is only 250 characters, and the acceptable storage data size cannot exceed 1 MB. Therefore, it is necessary first to divide and store the search results when storing the cache.</p>
<p>After the user enters the SQL command, the system will first use the MD5 algorithm [<xref ref-type="bibr" rid="ref-31">31</xref>] to convert the command into 16 bytes of hash value results and use the hash value as the unique identification code (i.e., key-value) for this search job, as shown in <xref ref-type="fig" rid="fig-11">Fig. 11</xref>. If there is the exact SQL requirement later, it still results in the same hash value through the MD5 algorithm. The system can obtain the data through this unique identification code and realize the mechanism of cache design.</p>
<fig id="fig-11"><label>Figure 11</label><caption><title>In-memory cache flow</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-11.png"/></fig>
</sec>
<sec id="s2_9"><label>2.9</label><title>In-Disk Cache Mechanism</title>
<p>As for in-disk, this paper uses HDFS distributed file system as the platform for in-disk cache storage. Since HDFS does not have storage capacity limitations like the Memcached system, it is only necessary to save a copy of the search result and upload it to HDFS. Similarly, the exact unique identification code as applied in the in-memory cache system names its file for identification, as shown in <xref ref-type="fig" rid="fig-12">Fig. 12</xref>. Since files in HDFS will not become invalid automatically, users must periodically delete them manually in HDFS. Therefore, this paper also has the function of active clearance. The user only needs to enter the &#x201C;purge x&#x201D; command in the CLI, and the system can automatically delete the cache data that the user has not accessed within x days.</p>
<fig id="fig-12"><label>Figure 12</label><caption><title>Cached files saving in HDFS</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-12.png"/></fig>
</sec>
</sec>
<sec id="s3"><label>3</label><title>Method</title>
<sec id="s3_1"><label>3.1</label><title>Improved Big Data Analytics</title>
<p>In <xref ref-type="fig" rid="fig-13">Fig. 13</xref>, improving the performance of big data analytics in a business intelligence system concerns three aspects: data retrieval, job scheduling, and different SQL command interfaces. The first one aims to speed up the data searching in big data. Therefore, this study must reduce the scope of data searching in the database and then implement data retrieval as fast as possible. This study introduced an approach to reduce the searching scope using data clustering with stacked sparse autoencoder, followed by a fast query response using quickly distributed data indexing with Elasticsearch, then stored tables into HDFS. The speed of data retrieval is relative to SQL command interfaces, so we will explore how to find the appropriate interface for the current situations of a node. This study has introduced three SQL command interfaces: Hive, Impala, and SparkSQL [<xref ref-type="bibr" rid="ref-28">28</xref>].</p>
<fig id="fig-13"><label>Figure 13</label><caption><title>The optimization of big data analytics flow</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-13.png"/></fig>
<p>On the other hand, the deep neural network carried out the time prediction of job execution to optimize job scheduling in big data analytics. According to the job with the shortest time for execution as the higher priority in scheduling, it can reduce the average waiting time in a queue. Finally, users only need to enter SQL commands through the Command-Line Interface (CLI), and the system will automatically detect the remaining memory size of a node and select the appropriate SQL interface to carry out the jobs, which can also speed up the job execution and is called the automatic interface selection. In particular, a SQL query can save the searching result into an in-memory or in-disk cache. Therefore, you can directly cache the data without launching the SQL command interfaces if you want to retrieve the same data repeatedly in a short time. The cache indeed speeds up the data retrieval dramatically.</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>Data Clustering by Mapping</title>
<p>The Stacked Sparse Autoencoder (SSAE) model can apply to a data set clustering. The dimensional vector of the SSAE input and output layers is the number of initial data columns denoted m. SSAE constructs the l-layer neurons as an encoder and the mirror layers as a decoder with unsupervised learning. The output of each layer is connected to the input of followed layer to reduce the data dimension and finally obtain an output of n-dimensional vector in the middle layer, and then run the decoder of an SSAE in reverse order to restore the initial input data column. There is an example of SSAE, as shown in <xref ref-type="fig" rid="fig-14">Fig. 14</xref>. In <xref ref-type="fig" rid="fig-14">Fig. 14</xref>, the loss function is MSE, the activation function is Sigmoid, and the optimizer is Adam.</p>
<fig id="fig-14"><label>Figure 14</label><caption><title>Data clustering using stacked sparse autoencoder</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-14.png"/></fig>
<p>The results of running 50 epochs between AE and SSAE training show their visualized results in <xref ref-type="fig" rid="fig-15">Fig. 15</xref>. Since AE does not have sparsity restriction, it turns out to be some of the different classes of data entangled around the same area, resulting in uneven data distribution and worse the effect of data clustering. In <xref ref-type="fig" rid="fig-15">Fig. 15a</xref>, two data sets may be mapped to the same area and have entangled, where there is no apparent separation between the classes, such as gray-class tangled with purple-class, and red-class tangled with brown-class. The SSAE with sparse constraints can effectively solve this problem, making the sampled data of the same class closer and the separation between different classes more pronounced, as shown in <xref ref-type="fig" rid="fig-15">Fig. 15b</xref>.</p>
<fig id="fig-15"><label>Figure 15</label><caption><title>Data visualization results between AE and SSAE training</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-15.png"/></fig>
<p>Here, we separate data into eight classes for a specific data set. Therefore, the output of the last hidden layer of the encoder from a trained SSAE model represents the mapping of each row of data to a point in the 3-dimensional space. According to the X, Y, and Z axes in the 3-dimensional space, SSAE automatically divided the data into eight classes. The corresponding eight quadrants numbering from 1 to 8 are inserted into the last column of the original data table, as shown in <xref ref-type="fig" rid="fig-1">Figs. 1</xref> and <xref ref-type="fig" rid="fig-2">2</xref>. In this study, the user applies the SSAE to data clustering instead of AE. After data clustering, the user can upload the tables to Elasticsearch for distributed indexing in them.</p>
</sec>
<sec id="s3_3"><label>3.3</label><title>Data Retrieval by Quick Indexing</title>
<p>The system will generally spend longer I/O time to cope with large-scale data retrieval. Distributed indexing using Elasticsearch can speed up the data search speed and resolve the bottleneck problem about single-machine processing with time-consuming. Multi-server processing in a cluster can implement an indexing function to run the data search as fast as possible. When started each node, the system will automatically join this node and designate one of them to act as the master node. Accordingly, Elasticsearch, as shown in <xref ref-type="fig" rid="fig-16">Fig. 16</xref>, indexed the data and then stored the data into the distributed file system HDFS.</p>
<fig id="fig-16"><label>Figure 16</label><caption><title>Cluster by Elasticsearch</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-16.png"/></fig>
<p>After clustering input data into three-dimensional space, the user can manually upload this data set to Elasticsearch for distributed indexing, as shown in <xref ref-type="fig" rid="fig-17">Fig. 17</xref>. When an input data set needs to be indexed, the indexing tool starts if the current node is the master node; otherwise, it will forward it to the master node. In <xref ref-type="fig" rid="fig-18">Fig. 18</xref>, Elasticsearch first checks whether the data set was indexed before. If not, Elasticsearch will start building the index, write the results into the Lucene index, and use the hash algorithm to ensure that it makes indexed data evenly distributed and stored in the designated primary shard and replica shard. Meanwhile, Elasticsearch will create a corresponding version number and store it in translog. Elasticsearch supposes comparing the existing version numbers with the new ones to check any conflict if the data set is indexed early. If not, it can start indexing. If yes, Elasticsearch returns an error result that writes to translog. Finally, indexed data stored in HDFS can apply for the job with SQL command, as shown in <xref ref-type="fig" rid="fig-13">Fig. 13</xref>.</p>
<fig id="fig-17"><label>Figure 17</label><caption><title>Data clustering and indexing flow</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-17.png"/></fig><fig id="fig-18"><label>Figure 18</label><caption><title>Indexing flow of Elasticsearch</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-18.png"/></fig>
</sec>
<sec id="s3_4"><label>3.4</label><title>Deep Neural Network Prediction for Job Scheduling Optimization</title>
<p>According to the SJF job scheduling algorithm, users can train a deep neural network (DNN) to predict the approximate job execution time. The input layer of this DNN has a six-dimensional vector, including data size, the number of data rows, the number of data columns, the time complexity of program execution, the SQL interface environment, and the remaining memory size. The output layer has only a one-dimensional vector (Label) as a time prediction of performing data retrieval, as shown in <xref ref-type="fig" rid="fig-19">Fig. 19</xref>. Users can collect real data from the web to train a DNN model. Three sets of SQL interfaces&#x2014;Hive, Impala, and SparkSQL&#x2014;are used to test a DNN time prediction applied to the data retrieval function under the different remaining memory sizes given. The activation function in a DNN is ReLU, the loss function MSE, and the optimizer Adam [<xref ref-type="bibr" rid="ref-32">32</xref>]. We will give the DNN model architecture and loss curve during the training, as shown in <xref ref-type="fig" rid="fig-20">Figs. 20</xref> and <xref ref-type="fig" rid="fig-21">21</xref>, respectively.</p>
<fig id="fig-19"><label>Figure 19</label><caption><title>Column &#x201C;Label(s)&#x201D; indicating the output of a deep neural network (DNN)</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-19.png"/></fig><fig id="fig-20"><label>Figure 20</label><caption><title>Deep neural network (DNN) architecture</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-20.png"/></fig><fig id="fig-21"><label>Figure 21</label><caption><title>Loss curve during DNN training phase</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-21.png"/></fig>
<p>In addition, we used a combination of deep neural network prediction and the shortest job priority scheduling to optimize the job scheduling. When a job enters the queue, the system first considers its execution priority and then predicts its approximate execution time to view as a job scheduling condition. Finally, the system will consider the remaining memory size to select the appropriate SQL interface to carry out the input SQL command. In such a way, the system can implement the significant throughput. <xref ref-type="fig" rid="fig-22">Fig. 22</xref> describes the flow of the proposed method mentioned above.</p>
<fig id="fig-22"><label>Figure 22</label><caption><title>The flow of job scheduling optimization</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-22.png"/></fig>
</sec>
<sec id="s3_5"><label>3.5</label><title>Execution Commends and Its Flow</title>
<p>The designed programs in this study contain many application functions. Users can issue commands through the command-line interface (CLI) to input SQL commands. All commands of the programs describe their functions, as listed in <xref ref-type="table" rid="table-1">Table 1</xref>. <xref ref-type="fig" rid="fig-23">Fig. 23</xref> gives the execution flow.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Execution commands</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Command</th>
<th align="left">Function</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">sql [priority][sql_cmd]</td>
<td align="left">Enter the SQL query commands with priority, and if the input priority is not set, the default value is 5 (the lowest priority)</td>
</tr>
<tr>
<td align="left">status</td>
<td align="left">Displays the current cluster status</td>
</tr>
<tr>
<td align="left">flush</td>
<td align="left">Clear all in-memory cache</td>
</tr>
<tr>
<td align="left">purge [days]</td>
<td align="left">Clear cache not access in the specified number of days</td>
</tr>
<tr>
<td align="left">display [on&#x007C;off]</td>
<td align="left">Whether to display search results</td>
</tr>
<tr>
<td align="left">enforced [name&#x007C;auto]</td>
<td align="left">Forced to use specified analysis interface</td>
</tr>
<tr>
<td align="left">set [number]</td>
<td align="left">Set the input SQL query command number</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-23"><label>Figure 23</label><caption><title>Job execution flow</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-23.png"/></fig>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Experiment Results and Discussion</title>
<sec id="s4_1"><label>4.1</label><title>Experimental Environment</title>
<p>The experiment uses the dynamic and adjustable resource characteristics of Proxmox VE to set up experimental environments with different memory sizes for the nodes in a cluster. The test environment is listed in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Test environment</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Test environment</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Environment I</td>
<td align="left">Configure the memory space of the cluster virtual machine to 5 GB</td>
</tr>
<tr>
<td align="left">Environment II</td>
<td align="left">Configure the memory space of the cluster virtual machine to 10&#x00A0;GB</td>
</tr>
<tr>
<td align="left">Environment III</td>
<td align="left">Configure the memory space of the cluster virtual machine to 20&#x00A0;GB</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2"><label>4.2</label><title>Data Sets</title>
<p>The experiments proceed with all jobs with real data sets collected from the web to verify that the proposed approaches can effectively improve business intelligence performance in big data analytics. User inputs the SQL commands related to a certain real data set and measures the job execution time consumed in a specific SQL interface. The real data sets collected from the web includes (1) world-famous books [<xref ref-type="bibr" rid="ref-33">33</xref>], (2) production machine load [<xref ref-type="bibr" rid="ref-34">34</xref>], (3) semiconductor product yield [<xref ref-type="bibr" rid="ref-35">35</xref>], (4) temperature, rainfall, and electricity consumption related to people&#x2019;s livelihood [<xref ref-type="bibr" rid="ref-36">36</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>], (5) forest flux station data [<xref ref-type="bibr" rid="ref-38">38</xref>], (6) Traffic violations/accidents [<xref ref-type="bibr" rid="ref-39">39</xref>], (7) Analysis of obesity factors [<xref ref-type="bibr" rid="ref-40">40</xref>], (8) Airport flight data [<xref ref-type="bibr" rid="ref-41">41</xref>]. The detailed information about real data sets describes as follows:</p>
<p>(1) World-famous books</p>
<p>This test is first to read the full text of the world-famous books as follows: Alice&#x2019;s Adventures in Wonderland, The Art of War, Adventures of Huckleberry Finn, Sherlock Holmes, and The Adventures of Tom Sawyer. After that, the word count task counts the number of occurrences of words, and sorts them from high-hit to low-hit in order. An example of plain text file of Alice&#x2019;s Adventures in Wonderland is shown in <xref ref-type="fig" rid="fig-24">Fig. 24</xref>.</p>
<fig id="fig-24"><label>Figure 24</label><caption><title>Plain text file of Alice&#x2019;s Adventures in Wonderland</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-24.png"/></fig>
<p>(2) Production machine load</p>
<p><xref ref-type="fig" rid="fig-25">Fig. 25</xref> is a product production process report provided by Taiwan&#x2019;s large packaging and testing factory. The content includes the product number, the responsible employee, the name of the production machine, etc. The file format is .xls format. The purpose is to find the production machines used too frequently or too low in the production schedule and provide the decision-making analysis of the person in charge of the future schedule. Count the number of times the production machine is used and calculate the overall sample standard deviation. According to the concept of normal distribution, find out the data outside the &#x201C;mean &#x00B1; 2 &#x002A; standard deviation&#x201D; range and treat it as a machine that may be Overloading or Underloading.</p>
<fig id="fig-25"><label>Figure 25</label><caption><title>Records of production machine loading</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-25.png"/></fig>
<p>(3) Semiconductor product yield</p>
<p>In <xref ref-type="fig" rid="fig-26">Fig. 26</xref>, the product test data provided by a significant packaging and testing company in Taiwan includes various semiconductor test items and PASS or FAIL results. The file format is a standard .csv format (separated by commas). The purpose is to calculate the yield rate of the product to see if it meets the company&#x2019;s yield rate standard (99.7&#x0025;).</p>
<fig id="fig-26"><label>Figure 26</label><caption><title>Records of semiconductor product</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-26.png"/></fig>
<p>(4) Temperature, rainfall, and electricity consumption related to livelihood</p>
<p>The website of Taiwan meteorological bureau has collected the rainfall and temperature data, as shown in <xref ref-type="fig" rid="fig-27">Fig. 27</xref>. The website of TAIPOWER has provided livelihood electricity data, as shown in <xref ref-type="fig" rid="fig-28">Fig. 28</xref>. For both of them data collection period is from January 01, 2007 to April 30, 2020. The purpose is to find the correlation between rainfall, temperature, and electricity consumption in Taiwan. Based on statistical &#x201C;Correlation,&#x201D; the correlation coefficient between the data is calculated and displayed as positive, negative, or no correlation. A certain linear correlation exists when 0 &#x003C; &#x007C;r&#x007C; &#x003C; 1 between the two variables. The closer &#x007C;r&#x007C; approaches to 1, the closer the linear relationship between the two variables is. Contrarily, the closer &#x007C;r&#x007C; approaches to 0, the weaker the linear relationship between the two variables is. Generally, it can be divided into three levels: &#x007C;r&#x007C; &#x003C; 0.4 is a low-degree linear correlation; 0.4 &#x2264; &#x007C;r&#x007C; &#x003C; 0.7 is a significant correlation; 0.7 &#x2264; &#x007C;r&#x007C; &#x003C; 1 is a high-degree linear correlation.</p>
<fig id="fig-27"><label>Figure 27</label><caption><title>Records of temperature and rainfall</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-27.png"/></fig><fig id="fig-28"><label>Figure 28</label><caption><title>Records of livelihood electricity consumption</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-28.png"/></fig>
<p>(5) Forest flux station data</p>
<p>In <xref ref-type="fig" rid="fig-29">Fig. 29</xref>, EU Open Data Portal has provided the forest flux station data that contains various flux information, including time and location, illuminance, soil information, and atmospheric flux. The file format is a standard .csv format (separated by commas). The purpose is to calculate the correlation coefficient between the CO<sub>2</sub> flux coefficient and the light intensity, temperature, and humidity, respectively, to examine the degree of correlation.</p>
<fig id="fig-29"><label>Figure 29</label><caption><title>Records of forest flux station</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-29.png"/></fig>
<p>(6) Traffic violations/accidents</p>
<p><xref ref-type="fig" rid="fig-30">Fig. 30</xref> has given the information about traffic violation/accident recorded from Maryland state in USA. Data are in the standard .csv format, i.e., each item separated by commas. Calculate the frequency statistics of monthly traffic violations and accident locations.</p>
<fig id="fig-30"><label>Figure 30</label><caption><title>Records of traffic violation/accident</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-30.png"/></fig>
<p>(7) Analysis of obesity factors</p>
<p>U.S. government has published the information about the obesity factors from 2011 to 2019, as shown in <xref ref-type="fig" rid="fig-31">Fig. 31</xref>. Data are in the standard .csv format, i.e., each item separated by commas. The objective of this test is to analyze the relationship between age, weekly exercise status, vegetable and fruit intake, and BMI through data statistics. Therefore, people can understand whether these factors affect human&#x2019;s body overweight.</p>
<fig id="fig-31"><label>Figure 31</label><caption><title>Records of obesity factor</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-31.png"/></fig>
<p>(8) Airport flight data</p>
<p><xref ref-type="fig" rid="fig-32">Fig. 32</xref> has shown the airport flight information recorded from New York airports in USA. Data are in the standard .csv format, i.e., each item separated by commas. This test is to calculate the proportion of the airport&#x2019;s flights to Taiwan in the total flights in the year.</p>
<fig id="fig-32"><label>Figure 32</label><caption><title>Records of airport flight information</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-32.png"/></fig>
</sec>
<sec id="s4_3"><label>4.3</label><title>Data Retrieval Experiment</title>
<p>This experiment tested each data set in different experimental environments, executed them according to the issued SQL commands, and compared their performance using different approaches. <xref ref-type="table" rid="table-3">Table 3</xref> has listed the different approaches applied in the experiments.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Test method</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">FIFO-Hive</td>
<td align="left">Use the &#x2018;enforced hive&#x2019; command to switch to the hive interface and enter the SQL command [<xref ref-type="bibr" rid="ref-28">28</xref>]</td>
</tr>
<tr>
<td align="left">FIFO-Impala</td>
<td align="left">Use the &#x2018;enforced impala&#x2019; command to switch to the impala interface and enter the SQL command [<xref ref-type="bibr" rid="ref-28">28</xref>]</td>
</tr>
<tr>
<td align="left">FIFO-SparkSQL</td>
<td align="left">Use the &#x2018;enforced sparksql&#x2019; command to switch to the sparksql interface and enter the SQL command [<xref ref-type="bibr" rid="ref-28">28</xref>]</td>
</tr>
<tr>
<td align="left">MSHEFT</td>
<td align="left">According to the job data size for scheduling and memory space for interface selection [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
</tr>
<tr>
<td align="left">DAE-SORL &#x002B; MSHEFT</td>
<td align="left">Pre-process the data with DAE-SOLR in advance and according to the job data size for scheduling and memory space for interface selection [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">Pre-process the data with SSAE-Elasticsearch in advance, and use DNN to predict job execution time for SJF scheduling and memory space for interface selection (proposed in this study)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="table" rid="table-3">Table 3</xref>, MSHEFT, DAE-SOLR &#x002B; MSHEFT, and SSAE-ES &#x002B; DNNSJF are different scheduling algorithms, and thus the order of job execution is scheduled in different sequences. In particular, the Hive interface can only complete the mission when the job has a large amount of data for analysis. However, the Impala or SparkSQL interface could not work successfully because the remaining memory size is insufficient. We noted that the performance of the Impala interface is more prominent when the remaining memory size is moderate. When the remaining memory size is significantly large, the SparkSQL interface can perform best due to its in-memory computing capability.</p>
<p>With automatic interface selection, the system can select the appropriate SQL interface and accept the SQL command from a user to perform the corresponding job in different scheduling algorithms, MSHEFT, DAE-SOLR &#x002B; MSHEFT, and SSAE-ES &#x002B; DNNSJF. If the data set sored in HDFS has completed data retrieval optimization in advance, the optimized data clustering and indexing can significantly reduce the average execution time of data searching. Experimental results have shown each job execution time, as shown in <xref ref-type="fig" rid="fig-33">Figs. 33</xref>&#x2013;<xref ref-type="fig" rid="fig-35">35</xref>, and the average job execution time and system throughput, as listed in <xref ref-type="table" rid="table-4">Tables 4</xref>&#x2013;<xref ref-type="table" rid="table-6">6</xref>.</p>
<fig id="fig-33"><label>Figure 33</label><caption><title>Job execution time of various methods in Environment I</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-33.png"/></fig>
<fig id="fig-34"><label>Figure 34</label><caption><title>Job execution time of various methods in Environment II</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-34.png"/></fig>
<fig id="fig-35"><label>Figure 35</label><caption><title>Job execution time of various methods in Environment III</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-35.png"/></fig>
<table-wrap id="table-4"><label>Table 4</label><caption><title>Average job execution time of various methods in Environment I</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Average Execution Time (second)</th>
<th align="left">System Throughput<break/>(&#x0023; of jobs/hour)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">FIFO-Hive</td>
<td align="left">4097</td>
<td align="left">0.88</td>
</tr>
<tr>
<td align="left">FIFO-Impala</td>
<td align="left">&#x2212;</td>
<td align="left">&#x2212;</td>
</tr>
<tr>
<td align="left">FIFO-SparkSQL</td>
<td align="left">&#x2212;</td>
<td align="left">&#x2212;</td>
</tr>
<tr>
<td align="left">MSHEFT</td>
<td align="left">4070</td>
<td align="left">0.88</td>
</tr>
<tr>
<td align="left">DAE-SOLR &#x002B; MSHEFT</td>
<td align="left">806</td>
<td align="left">4.47</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">375</td>
<td align="left">9.60</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn4_1"><p>Note: symbol &#x201C;&#x2212;&#x201D; stands for &#x201C;not available&#x201D;.</p></fn></table-wrap-foot></table-wrap>
<table-wrap id="table-5"><label>Table 5</label><caption><title>Average job execution time of various methods in Environment II</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Average Execution Time (second)</th>
<th align="left">System Throughput (&#x0023; of jobs/hour)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">FIFO-Hive</td>
<td align="left">3500</td>
<td align="left">1.03</td>
</tr>
<tr>
<td align="left">FIFO-Impala</td>
<td align="left">2987</td>
<td align="left">1.21</td>
</tr>
<tr>
<td align="left">FIFO-SparkSQL</td>
<td align="left">3143</td>
<td align="left">1.15</td>
</tr>
<tr>
<td align="left">MSHEFT</td>
<td align="left">2971</td>
<td align="left">1.21</td>
</tr>
<tr>
<td align="left">DAE-SOLR &#x002B; MSHEFT</td>
<td align="left">597</td>
<td align="left">6.03</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">332</td>
<td align="left">10.85</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-6"><label>Table 6</label><caption><title>Average job execution time of various methods in Environment III</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Average Execution Time (second)</th>
<th align="left">System Throughput<break/>(&#x0023; of jobs/hour)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">FIFO-Hive</td>
<td align="left">3126</td>
<td align="left">1.15</td>
</tr>
<tr>
<td align="left">FIFO-Impala</td>
<td align="left">2725</td>
<td align="left">1.32</td>
</tr>
<tr>
<td align="left">FIFO-SparkSQL</td>
<td align="left">2269</td>
<td align="left">1.59</td>
</tr>
<tr>
<td align="left">MSHEFT</td>
<td align="left">2250</td>
<td align="left">1.60</td>
</tr>
<tr>
<td align="left">DAE-SOLR &#x002B; MSHEFT</td>
<td align="left">507</td>
<td align="left">7.11</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">285</td>
<td align="left">12.63</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_4"><label>4.4</label><title>Job Scheduling Experiment</title>
<p>Since this study uses an automatic SQL interface selection, the appropriate SQL interface can be automatically selected to perform data analysis according to available memory size. Therefore, the experiments use this way to test an appropriate SQL interface selected to perform data analysis in different environments and compares the average job waiting time for execution among different scheduling algorithms.</p>
<p>To comply with the requirement of the experimental tests hereafter, they will use SSAE and Elasticsearch for all data sets to execute data retrieval together with the various job scheduling, as listed in <xref ref-type="table" rid="table-7">Table 7</xref>. We here tested the experimental environment I, II, and III, as listed in <xref ref-type="table" rid="table-2">Table 2</xref>. Since FIFO, MSHEFT, and DNNSJF are different job scheduling algorithms, the job scheduling algorithms will execute in a different order.</p>
<table-wrap id="table-7"><label>Table 7</label><caption><title>Job scheduling algorithm</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SSAE-ES &#x002B; FIFO</td>
<td align="left">Use SSAE-Elasticsearch to preprocess the data in advance. The program will first calculate the remaining memory size of the system, automatically select the execution interface, and use the first-in-first-out algorithm for scheduling. (proposed in this study)</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; MSHEFT</td>
<td align="left">Use SSAE-Elasticsearch to preprocess the data in advance. The program will first calculate the remaining memory size of the system, automatically select the execution interface, and schedule according to the size of the data from small to large. (proposed in this study)</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">Use SSAE-Elasticsearch to preprocess the data in advance. The program will first calculate the remaining memory size of the system, automatically select the execution interface, and use DNN to predict the job execution time for SJF scheduling. (proposed in this study)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>According to the experimental results, different data analysis jobs take different times. Generally speaking, the job scheduling performance of MSHEFT is good. However, the MSHEFT algorithm will do the same job scheduling as the FIFO algorithm when different jobs have the same data set size. The experimental results show that the proposed approach DNNSJF algorithm obtains the shortest average waiting time and average weighted turnaround time for each job execution, followed by the MSHEFT scheduling algorithm that schedules the jobs based on the amount of data size, and the FIFO algorithm reaches the poor one. The experimental results have shown the waiting time for each job execution, as shown in <xref ref-type="fig" rid="fig-36">Figs. 36</xref>&#x2013;<xref ref-type="fig" rid="fig-38">38</xref>. They have also shown the average job waiting time for execution and average weighted turnaround time, as listed in <xref ref-type="table" rid="table-8">Tables 8</xref>&#x2013;<xref ref-type="table" rid="table-10">10</xref>.</p>

<fig id="fig-36"><label>Figure 36</label><caption><title>Job waiting time of various methods in Environment I</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-36.png"/></fig>
<fig id="fig-37"><label>Figure 37</label><caption><title>Job waiting time of various methods in Environment II</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-37.png"/></fig>
<fig id="fig-38"><label>Figure 38</label><caption><title>Job waiting time of various methods in Environment III</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20128-fig-38.png"/></fig>
<table-wrap id="table-8"><label>Table 8</label><caption><title>Average job waiting time of various methods in Environment I</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Average Waiting Time (second)</th>
<th align="left">Average Weighted Turnaround Time</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SSAE-ES &#x002B; FIFO</td>
<td align="left">1576</td>
<td align="left">9.51</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">1497</td>
<td align="left">9.43</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; MSHEFT</td>
<td align="left">1510</td>
<td align="left">11.17</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-9"><label>Table 9</label><caption><title>Average job waiting time of various methods in Environment II</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Average Waiting Time (second)</th>
<th align="left">Average Weighted Turnaround Time</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SSAE-ES &#x002B; FIFO</td>
<td align="left">1361</td>
<td align="left">9.78</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">1294</td>
<td align="left">9.70</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; MSHEFT</td>
<td align="left">1306</td>
<td align="left">11.58</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-10"><label>Table 10</label><caption><title>Average job waiting time of various methods in Environment III</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Average Waiting Time (second)</th>
<th align="left">Average Weighted Turnaround Time</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SSAE-ES &#x002B; FIFO</td>
<td align="left">1152</td>
<td align="left">10.27</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; DNNSJF</td>
<td align="left">1094</td>
<td align="left">10.18</td>
</tr>
<tr>
<td align="left">SSAE-ES &#x002B; MSHEFT</td>
<td align="left">1104</td>
<td align="left">12.54</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_5"><label>4.5</label><title>Hypothesis Testing</title>
<p>The data tested in this study cannot be directly assumed to obey a specific distribution, and this study should test it using a nonparametric statistical method. This paper will introduce the Wilcoxon signed-rank test [<xref ref-type="bibr" rid="ref-42">42</xref>] as a hypothesis testing in this study, which is a nonparametric statistical method for testing a single sample. In <xref ref-type="table" rid="table-11">Table 11</xref>, the first hypothesis testing will proceed with the Wilcoxon signed-rank test between the previous method DAE-SOLR &#x002B; MSHEFT and the proposed approach SSAE-ES &#x002B; DNNSJF where test has sampled 30 valid data from Experiments I, II, and III of data retrieval. Next, the second one will proceed with the Wilcoxon signed-rank test between the previous method SSAE-ES &#x002B; MSHEFT and the proposed approach SSAE-ES &#x002B; DNNSJF where the test has sampled nine valid data from Experiments I, II, and III of job scheduling.</p>
<table-wrap id="table-11"><label>Table 11</label><caption><title>Wilcoxon signed-rank test</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Variable/Statistic</th>
<th align="left">Cases in Data Retrieval</th>
<th align="left">Cases in Job Scheduling</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">n</td>
<td align="left">30</td>
<td align="left">9</td>
</tr>
<tr>
<td align="left">Sum of positive sequence</td>
<td align="left">9177</td>
<td align="left">348</td>
</tr>
<tr>
<td align="left">Sum of negative sequence</td>
<td align="left">0</td>
<td align="left">0</td>
</tr>
<tr>
<td align="left">T</td>
<td align="left">0</td>
<td align="left">0</td>
</tr>
<tr>
<td align="left">E(T)</td>
<td align="left">232.5</td>
<td align="left">22.5</td>
</tr>
<tr>
<td align="left">VAR(T)</td>
<td align="left">2363.75</td>
<td align="left">71.25</td>
</tr>
<tr>
<td align="left">Z<sub>&#x03B1;</sub> (&#x03B1; &#x003D; 0.05)</td>
<td align="left">1.65</td>
<td align="left">1.65</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>Z</mml:mi></mml:math></inline-formula></td>
<td align="left">4.78</td>
<td align="left">2.61&#x002A;</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn11_1"><p>Notes: symbol &#x201C;&#x002A;&#x201D; represents the value of <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mrow><mml:msup><mml:mi>Z</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula>.</p></fn></table-wrap-foot></table-wrap>
<p>Assuming a one-tailed test of &#x03B1; &#x003D; 0.05, the null hypothesis and the alternative hypothesis are as follows:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>P</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>C</mml:mi></mml:msub></mml:mrow><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>P</mml:mi></mml:msub></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>C</mml:mi></mml:msub></mml:mrow></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>P</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>C</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> are the median time-consuming of using the previous method and the proposed approach. Assuming that T is the smallest of the positive and negative sorted sums, <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref> expresses the following formula:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>E</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref> is the expected value of the random variable T and <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>V</mml:mi><mml:mi>A</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref> the variance
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>E</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>4</mml:mn></mml:math></disp-formula>
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>V</mml:mi><mml:mi>A</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msup><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>24</mml:mn></mml:math></disp-formula></p>
<p>When the number of samples is proper (n &#x2265; 20), <italic>Z</italic> stands for the test statistic in <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>. If the number of samples is too small (n&#x2009;&#x003C;&#x2009;20), <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mrow><mml:msup><mml:mi>Z</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> represents the continuity correction of the test statistic <italic>Z</italic> in <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref>.
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>Z</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mpadded height="+0.7ex" depth="-0.7ex"><mml:mstyle displaystyle="false" scriptlevel="0"><mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>E</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mstyle></mml:mpadded><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>E</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msqrt><mml:mi>V</mml:mi><mml:mi>A</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msqrt></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mpadded height="-0.7ex" depth="+0.7ex"><mml:mstyle displaystyle="false" scriptlevel="0"><mml:mrow><mml:mrow><mml:msqrt><mml:mi>V</mml:mi><mml:mi>A</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msqrt></mml:mrow></mml:mrow></mml:mstyle></mml:mpadded></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:msup><mml:mi>Z</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mpadded height="+0.7ex" depth="-0.7ex"><mml:mstyle displaystyle="false" scriptlevel="0"><mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>E</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:mrow></mml:mstyle></mml:mpadded><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>E</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:mrow><mml:msqrt><mml:mi>V</mml:mi><mml:mi>A</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msqrt></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mpadded height="-0.7ex" depth="+0.7ex"><mml:mstyle displaystyle="false" scriptlevel="0"><mml:mrow><mml:mrow><mml:msqrt><mml:mi>V</mml:mi><mml:mi>A</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msqrt></mml:mrow></mml:mrow></mml:mstyle></mml:mpadded></mml:mrow></mml:math></disp-formula></p>
<p>As a result, looking-up <italic>Z</italic> table, the <italic>Z</italic> value of &#x03B1; &#x003D; 0.05 is 1.65 that is less than both 4.78 (<inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>Z</mml:mi></mml:math></inline-formula>&#x00A0;value) and 2.61 (<inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:msup><mml:mi>Z</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> value) as shown in <xref ref-type="table" rid="table-11">Table 11</xref>. The decision can reject the null hypothesis <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>. That indicates that the sampling data have obtained significant results in the Wilcoxon signed-rank test for the proposed approaches in this paper.</p>

</sec>
<sec id="s4_6"><label>4.6</label><title>Discussion</title>
<p>According to the data retrieval experiments, the SSAE-ES approach partitions the vast data to eight data zones successfully, adds sparse constraints to enhance searching capability, and implements an indexing function for the large-scale database to realize fast indexing of target data. Compared with the previous method DAE-SOLR, the SSAE-ES approach effectively improves the speed of data retrieval up to 44&#x223C;53&#x0025; and increases system throughput by 44&#x223C;53&#x0025;. As a result, the proposed approach SSAE-ES outperforms the other alternatives in data retrieval in the experiments. The results of the experiments showed the strengths of the proposed SSAE-ES algorithm. Besides, a statistical test using Wilcoxon signed-rank test can support the claim on improved results obtained with the proposed approach. SparkSQL performs better data retrieval than Impala or Hive in the experiments given considerable memory size.</p>
<p>Next, according to the job scheduling experiments, the MSHEFT algorithm performs job scheduling based on the data size for each job. It can decrease the waiting time of job execution in a queue. However, when it comes up with different jobs with the same data size, the MSHEFT algorithm proceeds with the job scheduling just like a FIFO scheduling. That is a bad situation. Therefore, the proposed DNNSJF approach can effectively overcome this problem because it can infer the different execution times for the same data size of many different jobs. Compared with FIFO and MSHEFT job scheduling algorithms, the DNNSJF approach can shorten the job&#x2019;s average waiting time to 3&#x223C;5&#x0025; and 1&#x223C;3&#x0025;, and the average weighted turnaround time by 0.8&#x0025;&#x223C;09.9&#x0025; and 16&#x0025;&#x223C;19&#x0025;. The results of the experiments showed the strengths of the proposed DNNSJF approach. Besides, a statistical test using Wilcoxon signed-rank test can present the claim confirmed on the significance test results with the proposed approach.</p>
<p>The experiments found that in-memory cache will lose part of retrieval data when a block of data size is more extensive than 100,000 bytes. Memcached may cause data loss when it has written a large amount of retrieval data back to the in-memory cache in a single stroke. The experiments showed the weaknesses of the Memcached used in the in-memory cache.</p>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion</title>
<p>In this paper, the theoretical implication is to improve the efficiency of big data analytics on two aspects: speed up data retrieval in big data analytics, increase system throughput concurrently, and optimize job scheduling to shorten the waiting time for job execution in a queue. There are two significant findings in this study. First, this study explores the advanced searching and indexing approaches to improve the speed of data retrieval up to 53&#x0025; and increase system throughput by 53&#x0025; compared with the previous method. Next, this study exploits the deep learning model to predict the job execution time used to arrange prioritized job scheduling, shortening the average waiting time up to 5&#x0025; and the average weighted turnaround time by 19&#x0025;. As a result, big data analytics and its application of business intelligence can achieve high performance and high efficiency based on our proposed approaches in this study. However, the system has shortcomings about the limits of in-memory cache operations. When a large amount of data in a single in-memory cache block (more than 100,000 bytes) occurs, we will lose part of the block and not write the retrieval results entirely into the in-memory cache. Moreover, the system can only write a single stroke to the in-memory cache sequentially, i.e., time-consuming. We have to find a way to improve both problems as mentioned earlier in the future.</p>
<p>According to the themes discussed in this paper, there are some aspects worth exploring more profoundly in the future. The first is to find a model with better precision in the job execution time prediction. Looking for the other deep learning model, e.g., a long short-term memory is technically feasible instead of the DNN model. Secondly, you can write a JDBC interface to connect with the main program to extend the other SQL interfaces. Finally, we have to find a better solution to the limits of in-memory cache operations about the size of a block and a single sequential stroke writing to avoid losing part of blocks while writing.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Author Contributions:</bold> B.R.C. and Y.C.L. conceived and designed the experiments; H.F.T. collected the experimental dataset, and H.F.T. proofread the paper; B.R.C. wrote the paper.</p></fn>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This paper is supported and granted by the Ministry of Science and Technology, Taiwan (MOST 110-2622-E-390-001 and MOST 109-2622-E-390-002-CC3).</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that there is no conflict of interests regarding the publication of the paper.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>1.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guedea-Noriega</surname>, <given-names>H. H.</given-names></string-name>, <string-name><surname>Garc&#x00ED;a-S&#x00E1;nchez</surname>, <given-names>F.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Semantic (Big) Data analysis: An extensive literature review</article-title>. <source>IEEE Latin America Transactions</source><italic>,</italic> <volume>17</volume><italic>,</italic> <fpage>796</fpage>&#x2013;<lpage>806</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TLA.2019.8891948</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>2.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gheorghe</surname>, <given-names>A. G.</given-names></string-name>, <string-name><surname>Crecana</surname>, <given-names>C. C.</given-names></string-name>, <string-name><surname>Negru</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Pop</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Dobre</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Decentralized storage system for edge computing decentralized storage system for edge computing</article-title>. <conf-name>2019 18th International Symposium on Parallel and Distributed Computing (ISPDC)</conf-name>, <conf-loc>Amsterdam, Netherlands</conf-loc>. DOI <pub-id pub-id-type="doi">10.1109/ISPDC.2019.00009</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>3.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lee</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Chung</surname>, <given-names>J. M.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Time estimation and resource minimization scheme for apache spark and hadoop Big Data systems with failures</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>7</volume><italic>,</italic> <fpage>9658</fpage>&#x2013;<lpage>9666</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2891001</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>4.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Deshpande</surname>, <given-names>P. M.</given-names></string-name>, <string-name><surname>Margoor</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Venkatesh</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Automatic tuning of SQL-on-Hadoop engines on cloud platforms</article-title>. <conf-name>2018 IEEE 11th International Conference on Cloud Computing (CLOUD)</conf-name>, <conf-loc>San Francisco, CA, USA</conf-loc>. DOI <pub-id pub-id-type="doi">10.1109/CLOUD.2018.00071</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>5.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hadjar</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Jedidi</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2019</year>). <article-title>A new approach for scheduling tasks and/or jobs in Big Data Cluster</article-title>. <conf-name>2019 4th MEC International Conference on Big Data and Smart City (ICBDSC)</conf-name>, <conf-loc>Muscat, Oman</conf-loc>.</mixed-citation></ref>
<ref id="ref-6"><label>6.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Lian</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Intelligent analysis of medical Big Data based on deep learning</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>7</volume><italic>,</italic> <fpage>142022</fpage>&#x2013;<lpage>142037</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ICBDSC.2019.8645613</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>7.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname>, <given-names>F. Y.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>J. J.</given-names></string-name>, <string-name><surname>Zheng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Yuan</surname>, <given-names>Y.</given-names></string-name> <etal>et al.</etal> </person-group> (<year>2016</year>). <article-title>Where does AlphaGo go: From church-turing thesis to AlphaGo thesis and beyond</article-title>. <source>IEEE/CAA Journal of Automatica Sinica</source><italic>,</italic> <volume>3</volume><italic>,</italic> <fpage>113</fpage>&#x2013;<lpage>120</lpage>. DOI <pub-id pub-id-type="doi">10.1109/JAS.2016.7471613</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>8.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>F.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Survey on lie group machine learning</article-title>. <source>Big Data Mining and Analytics</source><italic>,</italic> <volume>3</volume><italic>,</italic> <fpage>235</fpage>&#x2013;<lpage>258</lpage>. DOI <pub-id pub-id-type="doi">10.26599/BDMA.2020.9020011</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>9.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Klinefelter</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Nanzer</surname>, <given-names>J. A.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Interferometric microwave radar with a feedforward neural network for vehicle speed-over-ground estimation</article-title>. <source>IEEE Microwave and Wireless Components Letters</source><italic>,</italic> <volume>30</volume><italic>,</italic> <fpage>304</fpage>&#x2013;<lpage>307</lpage>. DOI <pub-id pub-id-type="doi">10.1109/LMWC.2020.2966191</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>10.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ma</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Bao</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Yuan</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>X.</given-names></string-name></person-group> (<year>2017</year>). <article-title>A Mongolian information retrieval system based on solr</article-title>. <conf-name>2017 9th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA)</conf-name>, <conf-loc>Changsha, China</conf-loc>. DOI <pub-id pub-id-type="doi">10.1109/ICMTMA.2017.0087</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>11.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yan</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Effective feature extraction via stacked sparse autoencoder to improve intrusion detection system</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>6</volume><italic>,</italic> <fpage>41238</fpage>&#x2013;<lpage>41248</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ACCESS.2018.2858277</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>12.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Brownlow</surname>, <given-names>B. N.</given-names></string-name>, <string-name><surname>Kanjamala</surname>, <given-names>P. P.</given-names></string-name>, <string-name><surname>Garcia Arredondo</surname>, <given-names>C. A.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2016</year>). <article-title>Real-time or near real-time persisting daily healthcare data into HDFS and ElasticSearch index inside a Big Data platform</article-title>. <source>IEEE Transactions on Industrial Informatics</source><italic>,</italic> <volume>13</volume><italic>,</italic> <fpage>595</fpage>&#x2013;<lpage>606</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TII.2016.2645606</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>13.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>Z.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Time series data for equipment reliability analysis with deep learning</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>8</volume><italic>,</italic> <fpage>105484</fpage>&#x2013;<lpage>105493</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ACCESS.2020.3000006</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>14.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Teraiya</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Shah</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Comparative study of LST and SJF scheduling algorithm in soft real-time system with its implementation and analysis</article-title>. <conf-name>2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI)</conf-name>, <conf-loc>Bangalore, India</conf-loc>. DOI <pub-id pub-id-type="doi">10.1109/ICACCI.2018.8554483</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>15.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Tian</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Ye</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Cloud resource scheduling with deep reinforcement learning and imitation learning</article-title>. <source>IEEE Internet of Things Journal</source><italic>,</italic> <volume>8</volume><issue>(5)</issue><italic>,</italic> <fpage>3576</fpage>&#x2013;<lpage>3586</lpage>. DOI <pub-id pub-id-type="doi">10.1109/JIOT.2020.3025015</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>16.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yeh</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>H.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Realizing prioritized scheduling service in the hadoop system</article-title>. <conf-name>2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud)</conf-name>, <conf-loc>Barcelona, Spain</conf-loc>. DOI <pub-id pub-id-type="doi">10.1109/FiCloud.2018.00015</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>17.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Thangaselvi</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Ananthbabu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Jagadeesh</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Aruna</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2015</year>). <article-title>Improving the efficiency of MapReduce scheduling algorithm in hadoop</article-title>. <conf-name>2015 International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT)</conf-name>, <conf-loc>Davangere, India</conf-loc>. DOI <pub-id pub-id-type="doi">10.1109/ICATCCT.2015.7456856</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>18.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sinaga</surname>, <given-names>K. P.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>M. S.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Unsupervised K-means clustering algorithm</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>8</volume><italic>,</italic> <fpage>80716</fpage>&#x2013;<lpage>80727</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ACCESS.2020.2988796</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>19.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Marquez</surname>, <given-names>E. S.</given-names></string-name>, <string-name><surname>Hare</surname>, <given-names>J. S.</given-names></string-name>, <string-name><surname>Niranjan</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Deep cascade learning</article-title>. <source>IEEE Transactions on Neural Networks and Learning Systems</source><italic>,</italic> <volume>29</volume><italic>,</italic> <fpage>5475</fpage>&#x2013;<lpage>5485</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TNNLS.2018.2805098</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>20.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gupta</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Thakur</surname>, <given-names>H. K.</given-names></string-name>, <string-name><surname>Shrivastava</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Nag</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2017</year>). <article-title>A big data analysis framework using apache spark and deep learning</article-title>. <conf-name>2017 IEEE International Conference on Data Mining Workshops (ICDMW)</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>. DOI <pub-id pub-id-type="doi">10.1109/ICDMW.2017.9</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>21.</label><mixed-citation publication-type="thesis"><person-group person-group-type="author"><string-name><surname>Lee</surname>, <given-names>Y. D.</given-names></string-name>, <string-name><surname>Chang</surname>, <given-names>B. R.</given-names></string-name></person-group> (<year>2018</year>). <source>Deep learning-based integration and optimization of rapid data retrieval in Big Data platforms</source> (<source>Master Thesis</source>)<italic>,</italic> <publisher-name>Department of Computer Science and Information Engineering, National University of Kaohsiung</publisher-name>, <publisher-loc>Taiwan</publisher-loc>. DOI <pub-id pub-id-type="doi">10.1155/2021/9022558</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>22.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chang</surname>, <given-names>B. R.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>Y. D.</given-names></string-name>, <string-name><surname>Liao</surname>, <given-names>P. H.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Development of multiple Big Data analytics platforms with rapid response</article-title>. <source>Scientific Programming</source><italic>,</italic> <volume>2017</volume><italic>,</italic> <fpage>6972461</fpage>. DOI <pub-id pub-id-type="doi">10.1155/2017/6972461</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>23.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abualigah</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Yousri</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Abd Elaziz</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Ewees</surname>, <given-names>A. A.</given-names></string-name>, <string-name><surname>Al-Qaness</surname>, <given-names>M. A.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Aquila optimizer: A novel meta-heuristic optimization algorithm</article-title>. <source>Computers &#x0026; Industrial Engineering</source><italic>,</italic> <volume>157</volume><italic>,</italic> <fpage>107250</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.cie.2021.107250</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>24.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abualigah</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Abd Elaziz</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Sumari</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Geem</surname>, <given-names>Z. W.</given-names></string-name>, <string-name><surname>Gandomi</surname>, <given-names>A. H.</given-names></string-name></person-group> (<year>2022</year>). <article-title>Reptile search algorithm (RSA): A nature-inspired meta-heuristic optimizer</article-title>. <source>Expert Systems with Applications</source><italic>,</italic> <volume>191</volume><italic>,</italic> <fpage>116158</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.eswa.2021.116158</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>25.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abualigah</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Diabat</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sumari</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Gandomi</surname>, <given-names>A. H.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Applications, deployments, and integration of Internet of Drones (IoD): A review</article-title>. <source>IEEE Sensors Journal</source><italic>,</italic> <volume>21</volume><issue>(22)</issue><italic>,</italic> <fpage>25532</fpage>&#x2013;<lpage>25546</lpage>. DOI <pub-id pub-id-type="doi">10.1109/JSEN.2021.3114266</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>26.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>W.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Supervised learning via unsupervised sparse autoencoder</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>6</volume><italic>,</italic> <fpage>73802</fpage>&#x2013;<lpage>73814</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ACCESS.2018.2884697</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>27.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Karacan</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Erdem</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Erdem</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2017</year>). <article-title>Alpha matting with KL-divergence-based sparse sampling</article-title>. <source>IEEE Transactions on Image Processing</source><italic>,</italic> <volume>26</volume><italic>,</italic> <fpage>4523</fpage>&#x2013;<lpage>4536</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TIP.2017.2718664</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>28.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chang</surname>, <given-names>B. R.</given-names></string-name>, <string-name><surname>Tsai</surname>, <given-names>H. F.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>Y. D.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Integrated high-performance platform for fast query response in Big Data with hive, impala, and SparkSQL: A performance evaluation</article-title>. <source>Applied Sciences</source><italic>,</italic> <volume>8</volume><issue>(9)</issue><italic>,</italic> <fpage>1514</fpage>, 26. DOI <pub-id pub-id-type="doi">10.3390/app8091514</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>29.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Topcuoglu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Hariri</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2002</year>). <article-title>Performance-effective and low-complexity task scheduling for heterogeneous computing</article-title>. <source>IEEE Transactions on Parallel and Distributed Systems</source><italic>,</italic> <volume>13</volume><issue>(3)</issue><italic>,</italic> <fpage>260</fpage>&#x2013;<lpage>274</lpage>. DOI <pub-id pub-id-type="doi">10.1109/71.993206</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>30.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Carra</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Michiardi</surname>, <given-names>P.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Memory partitioning and management in memcached</article-title>. <source>IEEE Transactions on Services Computing</source><italic>,</italic> <volume>12</volume><italic>,</italic> <fpage>564</fpage>&#x2013;<lpage>576</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TSC.2016.2613048</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>31.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ali</surname>, <given-names>A. M.</given-names></string-name>, <string-name><surname>Farhan</surname>, <given-names>A. K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>A novel improvement with an effective expansion to enhance the MD5 hash function for verification of a secure e-document</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>8</volume><italic>,</italic> <fpage>80290</fpage>&#x2013;<lpage>80304</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ACCESS.2020.2989050</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>32.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Verma</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Stoffov&#x00E1;</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Ill&#x00E9;s</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Tanwar</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>N.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Machine learning-based student&#x2019;s native place identification for real-time</article-title>. <source>IEEE Access</source><italic>,</italic> <volume>8</volume><italic>,</italic> <fpage>130840</fpage>&#x2013;<lpage>130854</lpage>. DOI <pub-id pub-id-type="doi">10.1109/ACCESS.2020.3008830</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lin</surname>, <given-names>Y. C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>World-famous books</article-title>. <uri xlink:href="https://github.com/did56789/World-famous-books.git">https://github.com/did56789/World-famous-books.git</uri>.</mixed-citation></ref>
<ref id="ref-34"><label>34.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lin</surname>, <given-names>Y. C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Production machine load data</article-title>. <uri xlink:href="https://github.com/did56789/Production-machine-load.git">https://github.com/did56789/Production-machine-load.git</uri>.</mixed-citation></ref>
<ref id="ref-35"><label>35.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lin</surname>, <given-names>Y. C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Semiconductor product yield data</article-title>. <uri xlink:href="https://github.com/did56789/Semiconductor-product-yield.git">https://github.com/did56789/Semiconductor-product-yield.git</uri>.</mixed-citation></ref>
<ref id="ref-36"><label>36.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><collab>MOTC, Central Weather Bureau</collab></person-group> (<year>2021</year>). <article-title>Rainfall and temperature data</article-title>. <uri xlink:href="https://www.cwb.gov.tw/V8/C/C/Statistics/monthlydata.html">https://www.cwb.gov.tw/V8/C/C/Statistics/monthlydata.html</uri>.</mixed-citation></ref>
<ref id="ref-37"><label>37.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><collab>Taiwan Power Company</collab></person-group> (<year>2021</year>). <article-title>Livelihood electricity data</article-title>. <uri xlink:href="https://www.taipower.com.tw/tc/page.aspx?mid=5554">https://www.taipower.com.tw/tc/page.aspx?mid=5554</uri>.</mixed-citation></ref>
<ref id="ref-38"><label>38.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><collab>EU Open Data Portal</collab></person-group> (<year>2021</year>). <article-title>The forest flux station data</article-title>. <uri xlink:href="https://data.europa.eu/data/datasets/jrc-abcis-it-sr2-2017?locale=en">https://data.europa.eu/data/datasets/jrc-abcis-it-sr2-2017?locale=en</uri>.</mixed-citation></ref>
<ref id="ref-39"><label>39.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lin</surname>, <given-names>Y. C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Traffic violations accidents data</article-title>. <uri xlink:href="https://github.com/did56789/Traffic-violations-accidents.git">https://github.com/did56789/Traffic-violations-accidents.git</uri>.</mixed-citation></ref>
<ref id="ref-40"><label>40.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><collab>Centers for Disease Control and Prevention</collab></person-group> (<year>2021</year>). <article-title>Nutrition, physical activity, and obesity-behavioral risk factor surveillance system</article-title>. <uri xlink:href="https://chronicdata.cdc.gov/Nutrition-Physical-Activity-and-Obesity/Nutrition-Physical-Activity-and-Obesity-Behavioral/hn4x-zwk7">https://chronicdata.cdc.gov/Nutrition-Physical-Activity-and-Obesity/Nutrition-Physical-Activity-and-Obesity-Behavioral/hn4x-zwk7</uri>.</mixed-citation></ref>
<ref id="ref-41"><label>41.</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><surname>Lin</surname>, <given-names>Y. C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Airport flight data</article-title>. <uri xlink:href="https://github.com/did56789/Airport-flight-data.git">https://github.com/did56789/Airport-flight-data.git</uri>.</mixed-citation></ref>
<ref id="ref-42"><label>42.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Hong</surname>, <given-names>W. C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads</article-title>. <source>Knowledge-Based Systems</source><italic>,</italic> <volume>228</volume><italic>,</italic> <fpage>107297</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.knosys.2021.107297</pub-id>.</mixed-citation></ref>
</ref-list>
</back>
</article>
