<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMES</journal-id>
<journal-id journal-id-type="nlm-ta">CMES</journal-id>
<journal-id journal-id-type="publisher-id">CMES</journal-id>
<journal-title-group>
<journal-title>Computer Modeling in Engineering &#x0026; Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1526-1506</issn>
<issn pub-type="ppub">1526-1492</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">20220</article-id>
<article-id pub-id-type="doi">10.32604/cmes.2022.020220</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Metal Corrosion Rate Prediction of Small Samples Using an Ensemble Technique</article-title>
<alt-title alt-title-type="left-running-head">Metal Corrosion Rate Prediction of Small Samples Using an Ensemble Technique</alt-title>
<alt-title alt-title-type="right-running-head">Metal Corrosion Rate Prediction of Small Samples Using an Ensemble Technique</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Yang</surname><given-names>Yang</given-names></name><xref ref-type="aff" rid="aff-1">1</xref>
<xref ref-type="aff" rid="aff-2">2</xref><email>swpu_yangy@126.com</email>
</contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zheng</surname><given-names>Pengfei</given-names></name><xref ref-type="aff" rid="aff-3">3</xref>
<xref ref-type="aff" rid="aff-4">4</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Zeng</surname><given-names>Fanru</given-names></name><xref ref-type="aff" rid="aff-5">5</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Xin</surname><given-names>Peng</given-names></name><xref ref-type="aff" rid="aff-6">6</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>He</surname><given-names>Guoxi</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Liao</surname><given-names>Kexi</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<aff id="aff-1"><label>1</label><institution>State Key Laboratory of Oil Gas Reservoir Geology and Exploitation, Southwest Petroleum University</institution>, <addr-line>Chengdu, 610500</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Earth Sciences and Technology, Southwest Petroleum University</institution>, <addr-line>Chengdu, 610500</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Spatial Information Technology and Big Data Mining Research Center, School of Earth Sciences and Technology, Southwest Petroleum University</institution>, <addr-line>Chengdu, 610500</addr-line>, <country>China</country></aff>
<aff id="aff-4"><label>4</label><institution>Sichuan Xinyang Anchuang Technology Co., Ltd.</institution>, <addr-line>Chengdu, 610500</addr-line>, <country>China</country></aff>
<aff id="aff-5"><label>5</label><institution>Sichuan Water Conservancy College</institution>, <addr-line>Chengdu, 610500</addr-line>, <country>China</country></aff>
<aff id="aff-6"><label>6</label><institution>CCDC Safety, Environment, Quality Supervision &#x0026; Testing Research Institute</institution>, <addr-line>Guanghan, 618300</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Yang Yang. Email: <email>swpu_yangy@126.com</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-08-11"><day>11</day>
<month>08</month>
<year>2022</year></pub-date>
<volume>134</volume>
<issue>1</issue>
<fpage>267</fpage>
<lpage>291</lpage>
<history>
<date date-type="received"><day>11</day><month>11</month><year>2021</year></date>
<date date-type="accepted"><day>24</day><month>2</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Yang et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Yang et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMES_20220.pdf"></self-uri>
<abstract>
<p>Accurate prediction of the internal corrosion rates of oil and gas pipelines could be an effective way to prevent pipeline leaks. In this study, a proposed framework for predicting corrosion rates under a small sample of metal corrosion data in the laboratory was developed to provide a new perspective on how to solve the problem of pipeline corrosion under the condition of insufficient real samples. This approach employed the bagging algorithm to construct a strong learner by integrating several KNN learners. A total of 99 data were collected and split into training and test set with a 9:1 ratio. The training set was used to obtain the best hyperparameters by 10-fold cross-validation and grid search, and the test set was used to determine the performance of the model. The results showed that the Mean Absolute Error (MAE) of this framework is 28.06&#x0025; of the traditional model and outperforms other ensemble methods. Therefore, the proposed framework is suitable for metal corrosion prediction under small sample conditions.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Oil pipeline</kwd>
<kwd>bagging</kwd>
<kwd>KNN</kwd>
<kwd>ensemble learning</kwd>
<kwd>small sample size</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>Owing to its economy and safety, pipeline transportation is one of the most important modes of oil and gas transportation [<xref ref-type="bibr" rid="ref-1">1</xref>]. However, the longer the pipeline is running, the greater the risk of corrosion [<xref ref-type="bibr" rid="ref-2">2</xref>]. Corrosion is the main incentive to increase the risk of oil and gas pipelines [<xref ref-type="bibr" rid="ref-3">3</xref>]. Due to the high cost of measuring equipment and the requirements of specific and regular calibration operations, it is not easy to obtain the status of corrosion defect measurement [<xref ref-type="bibr" rid="ref-4">4</xref>]. Thus, it is challenging to establish a high-precision corrosion prediction model. Despite this, many scholars have studied this problem and established different corrosion rate prediction models. Traditional (semi-empirical) models include the de Waard [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>], Cassandra (BP) [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-8">8</xref>], and Norsok [<xref ref-type="bibr" rid="ref-9">9</xref>] models. Of these, the de Waard model has been the most widely used since its establishment, although it rarely considers the influence of protective corrosion, especially at high temperature and pH values. The Cassandra model, on the other hand, takes into account the influence of corrosion inhibitors and can achieve better prediction performance at high temperatures, but it does not consider the influence of medium flow rates and Cl<sup>&#x2212;</sup>. The Norsok model is a purely empirical model established using a large amount of data, and takes into account the influence of a greater number of factors compared to the other models; however, as it does not consider the underlying mechanisms, it lacks universality, and it is relatively conservative in its predictions.</p>
<p>With the development of artificial intelligence (AI) technology, the use of deep learning methods to obtain better corrosion predictions for oil and gas pipelines has also become a focus of current research [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-12">12</xref>]. For example, Jain et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] proposed a quantitative evaluation model for the external corrosion rate of oil and gas pipelines based on Bayesian networks. Abbas et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] developed a neural network (NN) model to predict CO<sub>2</sub> corrosion in pipelines at high partial pressures. Ossai [<xref ref-type="bibr" rid="ref-15">15</xref>] developed a feedforward NN based on the particle swarm algorithm (PSO). Chen et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] proposed a fuzzy NN model of Principal Component Analysis Based Dynamic Fuzzy Neural Network (PCA-D-FNN). Seghier et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] outlined a new framework of the SVR-FFA model for more accurately predicting the maximum depth of pitting corrosion in oil and gas pipelines. Although these models have achieved high accuracy to a certain extent, such methods are also more demanding with regard to the quality and quantity of pipeline inspection data, which are not always consistent due to the diversity of field conditions and the limitations of inspection technology. As a result, this complicates the process of building models based on deep learning techniques and makes the application and extension of these methods more difficult [<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>]. Due to the limitations of data acquisition from real-world pipeline systems, there is a need for high-quality lab-scale experimental data.</p>
<p>Experiments in the dynamic reactor [<xref ref-type="bibr" rid="ref-19">19</xref>] allow corrosion data to be obtained under laboratory conditions; however, such methods tend to be both expensive and time-consuming, and as such can only provide limited data. To overcome the problem of small datasets, some researchers have turned to deep learning methods to obtain reliable prediction models. For example, Zhu et al. and Angshuman Paul et al. [<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-21">21</xref>] proposed models applicable to the prediction of image data. Chen et al. [<xref ref-type="bibr" rid="ref-22">22</xref>] proposed the use of an ensemble long short-term memory (EnLSTM) model for time-series data. Although these models do lead to greater prediction accuracy for small sample sizes, they cannot fully overcome the inherent disadvantages of NN overfitting and weak generalizability [<xref ref-type="bibr" rid="ref-23">23</xref>], which means that such models do not train reliably with small datasets [<xref ref-type="bibr" rid="ref-20">20</xref>]. Some studies have found that the use of data mining techniques that build and combine individual learners to form a strong learner capable of better predictions (also known as ensemble learning methods), play an important role in overcoming the overfitting problem [<xref ref-type="bibr" rid="ref-24">24</xref>], and are effectively able to handle data sets with high dimensionality, complex structures, and small sample sizes [<xref ref-type="bibr" rid="ref-25">25</xref>]. Such methods are even helpful in solving the problem of unbalanced data distribution [<xref ref-type="bibr" rid="ref-26">26</xref>]; as a result, many researchers have directed their efforts toward ensemble learning. For instance, Dvornik et al. [<xref ref-type="bibr" rid="ref-27">27</xref>] demonstrated a significant reduction in the variance by integrating a distance-based classifier in a small-sample setting. Mahdavi-Shahri et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] proposed an ensemble learning method to achieve the best classification results for a multi-label problem with small samples. Guan et al. [<xref ref-type="bibr" rid="ref-28">28</xref>] used ensemble techniques to improve the accuracy of face recognition under small-sample conditions. As the main ensemble learning methods, the bagging (bootstrap aggregation) [<xref ref-type="bibr" rid="ref-29">29</xref>] and boosting [<xref ref-type="bibr" rid="ref-30">30</xref>] algorithms, have been applied in many fields such as predicting cloud meteorological data [<xref ref-type="bibr" rid="ref-31">31</xref>], concrete bearing pressures [<xref ref-type="bibr" rid="ref-32">32</xref>], and the mapping of ecological zones in aerial hyperspectral images [<xref ref-type="bibr" rid="ref-33">33</xref>]. Bagging and KNN both perform well in small sample data sets, and scholars [<xref ref-type="bibr" rid="ref-34">34</xref>&#x2013;<xref ref-type="bibr" rid="ref-36">36</xref>] have also studied the models integrating them. However, it is still worth exploring whether the integrated model can predict small sample data sets of oil industry. Against this backdrop, a proposed framework for predicting corrosion rates under a small sample of metal corrosion data in the laboratory is proposed. The innovations and contributions are as follows:
<list list-type="bullet">
<list-item><p>Different from other ensemble models, a KNN-based learner&#x0027;s bagging model is proposed. This model has better performance on the small sample data set of this experiment, obviously outperforming the traditional model.</p></list-item>
<list-item><p>The ensemble models of bagging, boosting, and stacking are all used for comparison. In this experiment, bagging is slightly superior to other ensemble models.</p></list-item>
<list-item><p>Various factors affecting the experimental results were studied.</p></list-item>
</list></p>
<p><xref ref-type="sec" rid="s2">Section 2</xref> describes the features of the data involved in the experiment and how to obtain the data. <xref ref-type="sec" rid="s3">Section 3</xref> introduces the process and method of the experiment. In <xref ref-type="sec" rid="s4">Section 4</xref>, the influencing factors such as data dimension, segmentation rate, and noise reduction processing are discussed, the prediction errors of the integrated model under this data set are compared, and the advantages of the integrated model have been experimented with. <xref ref-type="sec" rid="s5">Section 5</xref> discusses the experimental results of <xref ref-type="sec" rid="s4">Section 4</xref>. Some primary conclusions are summarized in <xref ref-type="sec" rid="s5">Section 5</xref>.</p>
</sec>
<sec id="s2"><label>2</label><title>Materials and Experimental Database</title>
<p>For the laboratory experiments in this study, a dynamic reactor apparatus was used (<xref ref-type="fig" rid="fig-1">Fig. 1</xref>), with a solution consisting of 3 L of water obtained from a shale gas gathering pipeline. The operating parameters inside the reactor, including pressure, temperature, and partial pressure of CO<sub>2</sub> were controlled. In each group of experiments, four samples of the L360&#x2005;N pipeline (50&#x2009;&#x00D7;&#x2009;10&#x2009;&#x00D7;&#x2009;3 mm<sup>3</sup>) were used to measure uniform corrosion. The experimental protocol was as follows: firstly, the reactor body and lid were sealed, then the inlet and release valves on the lid were opened and nitrogen gas was passed through for two hours. Next, the release valve was closed and CO<sub>2</sub> and O<sub>2</sub> were injected. Finally, once the reactor pressure had increased to 5&#x2005;MPa through further injecting N<sub>2</sub>, the inlet valve was also closed. The experimental period was 7 days, and the mass difference of the metal samples before and after the experiment was divided by the reaction time to obtain the average corrosion rate (weight loss tests [<xref ref-type="bibr" rid="ref-37">37</xref>]). Then, the experiments were repeated under different conditions [<xref ref-type="bibr" rid="ref-38">38</xref>,<xref ref-type="bibr" rid="ref-39">39</xref>]. Then, the multiphase flow simulation software package OLGA [<xref ref-type="bibr" rid="ref-40">40</xref>] was used to expand the experimental parameters, including liquid flow rate, temperature, inclination angle, CO<sub>2</sub> partial pressure, and H<sub>2</sub>S partial pressure, in order to explore the influence of 18 other parameters including flow pattern, flow rate and shear stress on the corrosion rate. The name, abbreviation, unit, minimum/maximum value, average value, and standard derivation (SD) of the experimental data sets obtained by this method are shown in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Schematic diagram of the dynamic reactor apparatus [<xref ref-type="bibr" rid="ref-38">38</xref>]</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-1.png"/></fig>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Parameters in total in the experimental data sets</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Parameter</th>
<th align="left">Code</th>
<th align="left">Unit</th>
<th align="left">Minimum</th>
<th align="left">Maximum</th>
<th align="left">SD</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Flow type&#x002A;</td>
<td align="left">a</td>
<td align="left">&#x2014;&#x2014;</td>
<td align="left">1</td>
<td align="left">3</td>
<td align="left">&#x2014;&#x2014;</td>
</tr>
<tr>
<td align="left">Flow</td>
<td align="left">b</td>
<td align="left">m<sup>3</sup>/d</td>
<td align="left">2.48</td>
<td align="left">15.6</td>
<td align="left">3.18</td>
</tr>
<tr>
<td align="left">Gas shear stress</td>
<td align="left">c</td>
<td align="left">bar</td>
<td align="left">1.94E &#x2212; 05</td>
<td align="left">4.11E &#x2212; 05</td>
<td align="left">9.59E &#x2212; 06</td>
</tr>
<tr>
<td align="left">Liquid shear stress</td>
<td align="left">d</td>
<td align="left">bar</td>
<td align="left">9.61E &#x2212; 10</td>
<td align="left">0.00012</td>
<td align="left">2.92E &#x2212; 05</td>
</tr>
<tr>
<td align="left">Gas flow rate</td>
<td align="left">e</td>
<td align="left">m/s</td>
<td align="left">1.21</td>
<td align="left">6.39</td>
<td align="left">1.3</td>
</tr>
<tr>
<td align="left">Liquid flow rate</td>
<td align="left">f</td>
<td align="left">m/s</td>
<td align="left">0.0025</td>
<td align="left">1.41</td>
<td align="left">0.36</td>
</tr>
<tr>
<td align="left">Superficial gas flow rate &#x002A;&#x002A;</td>
<td align="left">g</td>
<td align="left">m/s</td>
<td align="left">1.2</td>
<td align="left">6.14</td>
<td align="left">1.17</td>
</tr>
<tr>
<td align="left">Superficial flow rate of liquid&#x00A0;&#x002A;&#x002A;</td>
<td align="left">h</td>
<td align="left">m/s</td>
<td align="left">0.0009</td>
<td align="left">0.0036</td>
<td align="left">0.0007</td>
</tr>
<tr>
<td align="left">Solid phase deposition rate</td>
<td align="left">i</td>
<td align="left">kg/m<sup>3</sup>-s</td>
<td align="left">0</td>
<td align="left">0.0064</td>
<td align="left">0.0014</td>
</tr>
<tr>
<td align="left">Gas flow</td>
<td align="left">j</td>
<td align="left">m<sup>3</sup>/d</td>
<td align="left">3312.52</td>
<td align="left">26257.18</td>
<td align="left">6087.2</td>
</tr>
<tr>
<td align="left">Inclination</td>
<td align="left">k</td>
<td align="left">&#x00B0;</td>
<td align="left">1.15</td>
<td align="left">2.1</td>
<td align="left">0.17</td>
</tr>
<tr>
<td align="left">Liquid holding rate &#x002A;&#x002A;&#x002A;</td>
<td align="left">l</td>
<td align="left">&#x2014;&#x2014;</td>
<td align="left">0.005</td>
<td align="left">0.38</td>
<td align="left">0.09</td>
</tr>
<tr>
<td align="left">CO<sub>2</sub> partial pressure</td>
<td align="left">m</td>
<td align="left">bar</td>
<td align="left">5.21</td>
<td align="left">6.59</td>
<td align="left">0.32</td>
</tr>
<tr>
<td align="left">pH</td>
<td align="left">n</td>
<td align="left">&#x2014;&#x2014;</td>
<td align="left">4.64</td>
<td align="left">4.72</td>
<td align="left">0.019</td>
</tr>
<tr>
<td align="left">Temperature</td>
<td align="left">o</td>
<td align="left">&#x00B0;C</td>
<td align="left">62.67</td>
<td align="left">78.88</td>
<td align="left">3.66</td>
</tr>
<tr>
<td align="left">Gas density</td>
<td align="left">p</td>
<td align="left">kg/m<sup>3</sup></td>
<td align="left">55.96</td>
<td align="left">71.14</td>
<td align="left">3.66</td>
</tr>
<tr>
<td align="left">Liquid density</td>
<td align="left">q</td>
<td align="left">kg/m<sup>3</sup></td>
<td align="left">1001.28</td>
<td align="left">1002.35</td>
<td align="left">0.26</td>
</tr>
<tr>
<td align="left">Liquid surface tension &#x002A;&#x002A;&#x002A;&#x002A;</td>
<td align="left">r</td>
<td align="left">N/m</td>
<td align="left">0.0014</td>
<td align="left">0.006</td>
<td align="left">0.001</td>
</tr>
<tr>
<td align="left">Gas-phase thermal conductivity</td>
<td align="left">s</td>
<td align="left">W/(m&#x22C5;&#x00B0;C)</td>
<td align="left">0.038</td>
<td align="left">0.04</td>
<td align="left">0.00051</td>
</tr>
<tr>
<td align="left">Liquid-phase temperature</td>
<td align="left">t</td>
<td align="left">&#x00B0;C</td>
<td align="left">24.5</td>
<td align="left">29.9</td>
<td align="left">1.39</td>
</tr>
<tr>
<td align="left">Gas viscosity</td>
<td align="left">u</td>
<td align="left">cP</td>
<td align="left">0.014</td>
<td align="left">0.014</td>
<td align="left">0.00019</td>
</tr>
<tr>
<td align="left">Liquid viscosity</td>
<td align="left">v</td>
<td align="left">cP</td>
<td align="left">0.8</td>
<td align="left">0.9</td>
<td align="left">0.023</td>
</tr>
<tr>
<td align="left">H<sub>2</sub>S partial pressure</td>
<td align="left">w</td>
<td align="left">bar</td>
<td align="left">3.78</td>
<td align="left">4.76</td>
<td align="left">0.235</td>
</tr>
<tr>
<td align="left">Corrosion rate</td>
<td align="left">x</td>
<td align="left">mm/a</td>
<td align="left">0.089</td>
<td align="left">0.59</td>
<td align="left">0.12</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn1_1"><p>Note: &#x002A; Flow type: the OLGA classifies flow into four types: stratified flow, annular flow, segmental plug flow, and bubble flow, which were denoted as 1, 2, 3, and 4, respectively, for this dataset. &#x002A;&#x002A; Apparent flow rate refers to a virtual (artificial) flow or a single fluid flow velocity (known as the apparent gas or liquid velocity depending on the type of fluid). &#x002A;&#x002A;&#x002A; Liquid holding rate is also known as the true liquid content rate or cross-sectional liquid content rate, refers to the proportion of the cross-sectional area of the liquid phase to the total cross-flow area in the process of water and gas flow. &#x002A;&#x002A;&#x002A;&#x002A; Liquid surface tension: the force that acts on the surface of a liquid to reduce its surface area.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>In order to better visualize the correlation between the parameters and the corrosion rate, a Pearson correlation coefficient matrix was drawn (<xref ref-type="fig" rid="fig-2">Fig. 2</xref>). The graph is used to show the linear correlations between parameters, where positive and negative values represent positive and negative correlation, respectively. As shown in the figure, the first five correlations are the solid phase deposition rate and liquid performance flow rate (0.8), the gas flow rate and liquid surface tension (0.78), the liquid density and liquid viscosity (0.77), the partial pressure of carbon dioxide and gas viscosity (0.77), the liquid-phase temperature and liquid viscosity (0.77).</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>The interaction indices were calculated by interaction detector (big value means strong interaction)</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-2.png"/></fig>
</sec>
<sec id="s3"><label>3</label><title>Establishment and Methods of the Framework</title>
<sec id="s3_1"><label>3.1</label><title>Bagging Algorithm</title>
<p>Ensemble learning is a very important area of artificial intelligence and data mining, which aims to build an integrated model by combining individual learners to improve the overall performance [<xref ref-type="bibr" rid="ref-41">41</xref>]. In terms of integration approaches bagging and boosting (the adaptive boosting (AdaBoost) algorithm [<xref ref-type="bibr" rid="ref-42">42</xref>] is the most commonly-implemented type of boosting algorithm) are two representative models of ensemble learning. Bagging is one of the first ensemble learning algorithms and uses a parallel integration strategy to randomly select different subsets of the training data. Each subset is then trained based on the same individual learners, and the final prediction results are obtained using a minority-majority approach for the classification problem and a simple average approach for the regression problem [<xref ref-type="bibr" rid="ref-43">43</xref>,<xref ref-type="bibr" rid="ref-44">44</xref>]. The bagging algorithm can improve generalization by reducing the variance error [<xref ref-type="bibr" rid="ref-45">45</xref>]. The integration steps are as follows: Suppose we have a training set
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mrow><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo></mml:mrow><mml:mi>y</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mtext>m,</mml:mtext></mml:mrow><mml:mi>y</mml:mi><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></disp-formula></p>
<p>The bagging algorithm first resamples the data set, creates a new training subset, and then puts the sampled data into the original data set again. Although this approach may result in some training samples being selected multiple times, while others may not be selected at all, this does not affect the prediction performance of the model. After selecting the individual learning algorithm, put each training subset into the algorithm for calculation.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>D</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>Combining multiple individual learners to build a learner with stronger predictive ability in the regression problem, the combination strategy is to simply average the prediction results of each weak learner.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>T</mml:mi></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>T</mml:mi></mml:munderover><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-1">Eqs. (1)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-3">(3)</xref>, <italic>D</italic> is the training dataset, <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the sample in the training dataset, <italic>m</italic> is the total number of samples, <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:math></inline-formula> is the features of the input data, <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:math></inline-formula> is the label value of the sample, <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> is an individual learner,<inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>D</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> is the subset constructed after resampling, <italic>t</italic> is the number of samples, <italic>L</italic> is the individual learning algorithm, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a learner with stronger predictive ability after integration.</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>KNN Algorithm</title>
<p>Individual learners refer to algorithmic models with simple structures. Generally speaking, in the bagging integration strategy, the individual learners are called base learners, and in the boosting integration strategy, they are called component learners [<xref ref-type="bibr" rid="ref-44">44</xref>]. For convenience, we collectively call them individual learners.</p>
<p>The main idea behind this algorithm is that the more similar things are, the more likely they are to be adjacent to each other, and it obtains the maximum possibility of the data type of the current point by looking for the data category with the most adjacent data points. This algorithm was chosen because of its simplicity and the ability to distinguish it from other algorithms through the idea of distance sampling [<xref ref-type="bibr" rid="ref-46">46</xref>]. The advantage of the algorithm is that it is simple and insensitive to outliers. The disadvantage of this algorithm is that it has high time complexity and spatial complexity, and its interpretation ability is not strong. However, under the condition of small samples, the computational pressure of the algorithm is greatly reduced. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> shows the forecasting principle of the KNN algorithm.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>Schematic diagram of KNN (K indicates the number of selected neighbors) [<xref ref-type="bibr" rid="ref-47">47</xref>]</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-3.png"/></fig>
</sec>
<sec id="s3_3"><label>3.3</label><title>Framework Building and Experimental Process</title>
<p>Because the model used for comparison has the same construction steps as this framework, we describe its construction process together in this section, and <xref ref-type="fig" rid="fig-4">Fig. 4</xref> illustrates the overall design flow of this study.</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>Flow chart of the framework and experimental process</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-4.png"/></fig>
<p>In order to make the model comparison more convincing, we use K-nearest neighbor (KNN) [<xref ref-type="bibr" rid="ref-48">48</xref>], Support vector machine (SVM) [<xref ref-type="bibr" rid="ref-49">49</xref>], and Classification and regression trees (CART) [<xref ref-type="bibr" rid="ref-50">50</xref>] as individual learners for bagging and boosting integration, and in addition, we choose four more mature integrated models of Random forest (RF) [<xref ref-type="bibr" rid="ref-51">51</xref>], Extra tree [<xref ref-type="bibr" rid="ref-52">52</xref>], Gradient boosting [<xref ref-type="bibr" rid="ref-53">53</xref>] and Light gradient boosting machine (LightGBM) [<xref ref-type="bibr" rid="ref-54">54</xref>] for comparison. Based on these four models, our experiments first explored the influence of factors such as eliminating discrete values, reducing dimensions, and changing the proportion of data segmentation on the experiment, which provided the basis for data preprocessing in the framework construction process.</p>
<p>According to the exploration results of the influencing factors, in the aspect of data preprocessing, firstly, according to the corrosion rate, the experimental data set was eliminated by box-plot [<xref ref-type="bibr" rid="ref-55">55</xref>]. Considering the different divisions of the training set and the test set under the condition of small samples, the results may be quite different. Before data splitting, three random seeds (222, 444 and 666, respectively) are used for out-of-order processing, to ensure that there are multiple groups of experiments to compare under the same data set. The data was then split into training and test sets at a ratio of 9:1 [<xref ref-type="bibr" rid="ref-56">56</xref>], and after the data was uniformly normalized, Principal Component Analysis (PCA) [<xref ref-type="bibr" rid="ref-57">57</xref>] dimensionality reduction was performed to retain more than 90&#x0025; of the information.</p>
<p>During the process of model building, a combination of 10-fold cross validation [<xref ref-type="bibr" rid="ref-58">58</xref>] and grid search was used to determine the optimal hyperparameter of each model. It was also debugged as a hyperparameter, and the grid search was carried out within a certain range. It was worth noting that we separately search the basic model and the integration method in the grid, and then integrated them according to the selected optimal hyperparameter. Firstly, the training data was divided into 10 subsets on average, of which 9 subsets were used for model training and the rest were used for model verification. This operation was repeated 10 times in a row, the average error was obtained, and then all the parameters in a certain range were traversed by the grid search method. Finally, the parameter combination with the smallest error in 10-fold cross-validation was obtained. The purpose of this process is to achieve the best predictability for each model and to minimize the influence of random seeds on the accuracy of the model. The generalization ability of each model was then tested on the test set and its error was calculated on the test set using mean square error (MSE), mean absolute percentage error (MAPE), and mean absolute error (MAE) metrics, where MSE is the main evaluation index. Finally, the Friedman ranking [<xref ref-type="bibr" rid="ref-59">59</xref>] was used to test the advantages and disadvantages of the model.</p>
<p>To verify the superiority of the model, we also compared the traditional empirical model with the more complex stack integration model and discussed whether the integrated basic learner can improve the performance.</p>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Results</title>
<p>This section shows a comparison between the results for the different ensemble learning methods under the conditions of a small sample dataset, followed by the analysis of the effects of discrete values, PCA dimensionality reduction, and data splitting ratio on the prediction. To verify the superiority of the ensemble algorithm approach, the prediction results for the ensemble learning models are compared with those of the weak classifier and traditional models. It is noted that, because of the smaller number of data sets in this experiment, the efficiency of calculation models (e.g., run-time) and not included in the comparison index.</p>
<sec id="s4_1"><label>4.1</label><title>Ensemble Learning Model Prediction Performance Comparison Results</title>
<p>MSE, MAE, and MAPE were used to determine the error value of the integrated algorithm on the test set. The differences between the predicted and true values of the bagging and boosting algorithms based on MSE are demonstrated in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. The closer the result is to the diagonal, the smaller the error between the two groups of data. The results show that when the random seeds were 222 and 444, the extra trees, bagging &#x002B; KNN, and AdaBoosting &#x002B; SVR models showed the minimum MSE values. However, when the random seed was 666, this changed to bagging &#x002B; CART, AdaBoosting &#x002B; CART, and AdaBoosting &#x002B; SVR. This indicates that the prediction performance of the model changes significantly depending on the training and test sets used.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>Comparison between model predictions and experimental values</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-5.png"/></fig>
<p>To evaluate the overall performance of each model on multiple datasets, a Friedman ranking plot was drawn (<xref ref-type="fig" rid="fig-6">Fig. 6</xref>). The blue dots in the graph indicate the average ranking of the algorithm, and the farther to the left the numerical value is, the better the model is. The results show that the prediction performance of the bagging &#x002B; KNN algorithm is the best, followed by extra-trees and bagging &#x002B; CART, and the prediction performance of random forest was the lowest. The horizontal lines in the figure indicate the allowable fluctuation range of each algorithm ranking. If the horizontal lines of a given pair of algorithms do not overlap each other, it indicates that there are significant differences between these algorithms. However, here, the algorithms in the graph all overlap with each other, indicating that there is no significant difference between them. The order of all algorithms from good to bad is bagging &#x002B; KNN, extra trees, bagging &#x002B; CART, adaboosting &#x002B; KNN, adaboosting &#x002B; CART, adaboosting &#x002B; SVR, bagging &#x002B; SVR, Gradient boosting, LightGBM, Random Forest.</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>Friedman diagram for the different ensemble learning algorithms</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-6.png"/></fig>
</sec>
<sec id="s4_2"><label>4.2</label><title>Experimental Results for Exploring the Influencing Factors</title>
<sec id="s4_2_1"><label>4.2.1</label><title>Error Comparison Results after Eliminating Discrete Values</title>
<p>The box-plot uses the quartile as a boundary for analysis of the distribution characteristics of the data, and the data for the corrosion rate term was used as the basis for drawing this plot (<xref ref-type="fig" rid="fig-7">Fig. 7</xref>). The red dashed line in the scatter plot indicates the split line, and the red points represent the discrete points (defined as &#x003E;0.42) that need to be eliminated by the box-plot. The total number of discrete points determined using this method is 11.</p>
<fig id="fig-7"><label>Figure 7</label><caption><title>Scatter plot and edge box-plot of the corrosion rate</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-7.png"/></fig>
<p>Among the models used for comparison, we uniformly set their hyperparameters ton_estimators&#x2009;&#x003D;&#x2009;100 and random_state&#x2009;&#x003D;&#x2009;222 and set the learning rates of the gradient boosting and LightGBM models to 0.1. Then, the data before and after excluding discrete values were substituted into these models, the results of the 10-fold cross-validation were recorded as shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. The results show that after removing outliers, the average prediction errors of random forest, extra-tree, gradient boosting, and LightGBM algorithms are reduced by 64.16&#x0025;, 68.50&#x0025;, 62.88&#x0025;, and 63.81&#x0025;, respectively. This shows that the use of box-plot to eliminate discrete values can help improve the prediction performance of the model.</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>10-fold cross-validation histograms before and after removing the discrete values</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-8.png"/></fig>
</sec>
<sec id="s4_2_2"><label>4.2.2</label><title>Error Comparison Results after Using PCA to Reduce the Dimensionality</title>
<p>In general, although the dataset needs multiple dimensions of features to be characterized as exhaustively as possible, it is not always advantageous to have more features, as the presence of some features may cause the prediction result to drop. In other words, some poorly correlated parameters can be removed [<xref ref-type="bibr" rid="ref-60">60</xref>]. In this data set, the percentage of each principal component, wherein the information interpretation rate of the first principal component is 55.68&#x0025;, and the corresponding cumulative sums of principal components are the sums of the first 14, 11, and 8 principal components when 95&#x0025;, 90&#x0025;, and 85&#x0025; of the information is retained, respectively.</p>
<p>In this part, only the dataset with an out-of-order seed of 222 was selected when the data was split, and PCA was used to divide this dataset into three-dimension data with dimensions retaining 95&#x0025;, 90&#x0025;, and 85&#x0025; of the information. The hyperparameter involved in the model was set to random_state&#x2009;&#x003D;&#x2009;222, and the learning rate of the gradient boosting and LightGBM was 0.1. 19 numbers from 10 to 200 were moderately spaced, and these numbers were set to n_estimator. Finally, the evaluation value of the predicted results of these models was used as the basis for model comparison. <xref ref-type="fig" rid="fig-9">Fig. 9</xref> shows the results of the calculations when no principal component scaling is used. <xref ref-type="table" rid="table-2">Table 2</xref> shows more details. This shows that the error value is significantly higher when the proportion of principal components was selected as 95&#x0025; as opposed to 90&#x0025; and 85&#x0025;; further, the error values for the latter two cases are similar, except that the MSE value of the extra tree model with 90&#x0025; principal components (0.001545) was smaller than that with 85&#x0025; principal components (0.002592). Therefore, although PCA is a commonly-used method for achieving data dimensionality reduction under small sample conditions, there is no uniform standard for how much information should be retained in different scenarios, and 90&#x0025; was found to be the optimum value in this study.</p>
<fig id="fig-9"><label>Figure 9</label><caption><title>Prediction error of each model after PCA dimensionality reduction</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-9.png"/></fig>
<table-wrap id="table-2"><label>Table 2</label><caption><title>MSE results for the different principal components</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead valign="top">
<tr>
<th align="left" rowspan="2">Algorithm</th>
<th align="left" rowspan="2">Principal components</th>
<th align="center" colspan="3">The MSE result of 10 random numbers</th>
</tr>
<tr>
<th align="left">Min</th>
<th align="left">Max</th>
<th align="left">Mean</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Random forest</td>
<td align="left">100&#x0025;</td>
<td align="left">0.0024</td>
<td align="left">0.0028</td>
<td align="left">0.0026</td>
</tr>
<tr>
<td align="left">Random forest</td>
<td align="left">95&#x0025;</td>
<td align="left">0.0024</td>
<td align="left">0.003</td>
<td align="left">0.0027</td>
</tr>
<tr>
<td align="left">Random forest</td>
<td align="left">90&#x0025;</td>
<td align="left">0.0015</td>
<td align="left">0.0023</td>
<td align="left">0.0017</td>
</tr>
<tr>
<td align="left">Random forest</td>
<td align="left">85&#x0025;</td>
<td align="left">0.0015</td>
<td align="left">0.003</td>
<td align="left">0.0017</td>
</tr>
<tr>
<td align="left">Gradient boosting</td>
<td align="left">100&#x0025;</td>
<td align="left">0.0023</td>
<td align="left">0.003</td>
<td align="left">0.0026</td>
</tr>
<tr>
<td align="left">Gradient boosting</td>
<td align="left">95&#x0025;</td>
<td align="left">0.0029</td>
<td align="left">0.0035</td>
<td align="left">0.0033</td>
</tr>
<tr>
<td align="left">Gradient boosting</td>
<td align="left">90&#x0025;</td>
<td align="left">0.0029</td>
<td align="left">0.003</td>
<td align="left">0.003</td>
</tr>
<tr>
<td align="left">Gradient boosting</td>
<td align="left">85&#x0025;</td>
<td align="left">0.0029</td>
<td align="left">0.003</td>
<td align="left">0.003</td>
</tr>
<tr>
<td align="left">Extra tree</td>
<td align="left">100&#x0025;</td>
<td align="left">0.0024</td>
<td align="left">0.003</td>
<td align="left">0.0027</td>
</tr>
<tr>
<td align="left">Extra tree</td>
<td align="left">95&#x0025;</td>
<td align="left">0.003</td>
<td align="left">0.0041</td>
<td align="left">0.003</td>
</tr>
<tr>
<td align="left">Extra tree</td>
<td align="left">90&#x0025;</td>
<td align="left">0.0012</td>
<td align="left">0.002</td>
<td align="left">0.0015</td>
</tr>
<tr>
<td align="left">Extra tree</td>
<td align="left">85&#x0025;</td>
<td align="left">0.0023</td>
<td align="left">0.0034</td>
<td align="left">0.0026</td>
</tr>
<tr>
<td align="left">LightGBM</td>
<td align="left">100&#x0025;</td>
<td align="left">0.0035</td>
<td align="left">0.0039</td>
<td align="left">0.0038</td>
</tr>
<tr>
<td align="left">LightGBM</td>
<td align="left">95&#x0025;</td>
<td align="left">0.0036</td>
<td align="left">0.0042</td>
<td align="left">0.004</td>
</tr>
<tr>
<td align="left">LightGBM</td>
<td align="left">90&#x0025;</td>
<td align="left">0.0025</td>
<td align="left">0.004</td>
<td align="left">0.0032</td>
</tr>
<tr>
<td align="left">LightGBM</td>
<td align="left">85&#x0025;</td>
<td align="left">0.0025</td>
<td align="left">0.0039</td>
<td align="left">0.0031</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2_3"><label>4.2.3</label><title>Error Comparison Results for Different Data Segmentation Ratios</title>
<p>Different models were used to compare the results of the prediction errors for different split ratios. As shown in <xref ref-type="fig" rid="fig-10">Fig. 10</xref> and <xref ref-type="table" rid="table-3">Table 3</xref>, the results show that when the split ratio is 9:1, these models predict the minimum average error, where the MSE average of RF is 0.0017, Gradient Boosting is 0.0026, Extra tree is 0.0024, and LightGBM is 0.0025. When the split ratio is 8:2 and 7:3, the performance on different models is also different.</p>
<fig id="fig-10"><label>Figure 10</label><caption><title>Prediction error of each model under different split ratios</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-10.png"/></fig>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Detailed parameter values of MSE results for different split ratios</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead valign="top">
<tr>
<th align="left">Algorithm</th>
<th align="left">Split ratio</th>
<th align="center" colspan="4">MSE results</th>
</tr>
<tr>
<td/>
<td/>
<th align="left">Random_state<break/>&#x2009;&#x003D;&#x2009;222</th>
<th align="left">Random_state<break/>&#x2009;&#x003D;&#x2009;444</th>
<th align="left">Random_state<break/>&#x2009;&#x003D;&#x2009;666</th>
<th align="left">Mean</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Random forest</td>
<td align="left">9:1</td>
<td align="left">0.0016</td>
<td align="left">0.0016</td>
<td align="left">0.0018</td>
<td align="left">0.0017</td>
</tr>
<tr>
<td align="left">Random forest</td>
<td align="left">8:2</td>
<td align="left">0.0021</td>
<td align="left">0.0037</td>
<td align="left">0.0027</td>
<td align="left">0.0028</td>
</tr>
<tr>
<td align="left">Random forest</td>
<td align="left">7:3</td>
<td align="left">0.0037</td>
<td align="left">0.0033</td>
<td align="left">0.0031</td>
<td align="left">0.0033</td>
</tr>
<tr>
<td align="left">Gradient boosting</td>
<td align="left">9:1</td>
<td align="left">0.0029</td>
<td align="left">0.0023</td>
<td align="left">0.0025</td>
<td align="left">0.0026</td>
</tr>
<tr>
<td align="left">Gradient boosting</td>
<td align="left">8:2</td>
<td align="left">0.0028</td>
<td align="left">0.0046</td>
<td align="left">0.0023</td>
<td align="left">0.0032</td>
</tr>
<tr>
<td align="left">Gradient boosting</td>
<td align="left">7:3</td>
<td align="left">0.0032</td>
<td align="left">0.0033</td>
<td align="left">0.0025</td>
<td align="left">0.0030</td>
</tr>
<tr>
<td align="left">Extra tree</td>
<td align="left">9:1</td>
<td align="left">0.0022</td>
<td align="left">0.002</td>
<td align="left">0.0029</td>
<td align="left">0.0024</td>
</tr>
<tr>
<td align="left">Extra tree</td>
<td align="left">8:2</td>
<td align="left">0.0024</td>
<td align="left">0.0034</td>
<td align="left">0.0031</td>
<td align="left">0.003</td>
</tr>
<tr>
<td align="left">Extra tree</td>
<td align="left">7:3</td>
<td align="left">0.0031</td>
<td align="left">0.0036</td>
<td align="left">0.0033</td>
<td align="left">0.0033</td>
</tr>
<tr>
<td align="left">LightGBM</td>
<td align="left">9:1</td>
<td align="left">0.0031</td>
<td align="left">0.0013</td>
<td align="left">0.0031</td>
<td align="left">0.0025</td>
</tr>
<tr>
<td align="left">LightGBM</td>
<td align="left">8:2</td>
<td align="left">0.0029</td>
<td align="left">0.0037</td>
<td align="left">0.0032</td>
<td align="left">0.0033</td>
</tr>
<tr>
<td align="left">LightGBM</td>
<td align="left">7:3</td>
<td align="left">0.0038</td>
<td align="left">0.003</td>
<td align="left">0.0029</td>
<td align="left">0.0032</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s4_3"><label>4.3</label><title>Comparison Results of Prediction Errors between Individual Learners and Ensemble Models</title>
<p>To investigate whether the ensemble learning models performed better than the individual learners, the results of integrating the individual learners KNN, SVR, and CART with the bagging and AdaBoosting algorithms were compared in three randomized segmentation experiments (<xref ref-type="fig" rid="fig-11">Fig. 11</xref>). These results show that the bagging algorithm reduces the MSE error values when the individual learners are KNN and the random seed is 222 or 444, while AdaBoosting increases the error value under all three random seeds; further, when the individual learners are SVR, except for the case with a random seed of 666, AdaBoosting significantly reduced the error, and in other cases, bagging and AdaBoosting had similar error values as the individual learners. However, when the individual learners are CART and the random seed is 444 or 666, both bagging and AdaBoosting reduce the prediction error values, while the opposite is observed for a random seed of 222.</p>
<fig id="fig-11"><label>Figure 11</label><caption><title>Comparison of the MSE values of different algorithms</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-11.png"/></fig>
</sec>
<sec id="s4_4"><label>4.4</label><title>Comparison Results of Prediction Errors between Ensemble Methods and Traditional Models</title>
<p>A traditional model for oil and gas pipeline corrosion rate prediction, the de Waard model [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>], is given by the following equations:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>V</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>V</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>V</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:msub><mml:mi>log</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mi>V</mml:mi><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>4.93</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1119</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>273</mml:mn></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mn>0.58</mml:mn><mml:mrow><mml:msub><mml:mi>log</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mi>P</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mn>0.34</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn><mml:mo>=</mml:mo><mml:mn>3.82</mml:mn><mml:mo>+</mml:mo><mml:mn>0.00384</mml:mn><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>0.5</mml:mn><mml:mrow><mml:msub><mml:mi>log</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mi>P</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:math></disp-formula>
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>V</mml:mi><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>2.45</mml:mn><mml:mfrac><mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:msup><mml:mi>l</mml:mi><mml:mrow><mml:mn>0.8</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mn>0.2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mi>P</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:math></disp-formula></p>
<p>In contrast to the de Waard model, the Norsok M506 model [<xref ref-type="bibr" rid="ref-9">9</xref>] uses different equations over different temperature intervals. In the temperature range from 20&#x00B0;C&#x2013;150&#x00B0;C, the model is given by:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>V</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mi>K</mml:mi><mml:mi>t</mml:mi><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>0.62</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>19</mml:mn></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>0.146</mml:mn><mml:mo>+</mml:mo><mml:mn>0.0324</mml:mn><mml:mrow><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mi>f</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>t</mml:mi></mml:math></disp-formula></p>
<p>Between 15&#x00B0;C&#x2013;20&#x00B0;C, the model is given by:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mi>V</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mi>K</mml:mi><mml:mi>t</mml:mi><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>0.36</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>19</mml:mn></mml:mrow></mml:mfrac><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>0.146</mml:mn><mml:mo>+</mml:mo><mml:mn>0.0324</mml:mn><mml:mrow><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mi>f</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>t</mml:mi></mml:math></disp-formula></p>
<p>Between 5&#x00B0;C&#x2013;15&#x00B0;C, the model is given by:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>V</mml:mi><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mi>K</mml:mi><mml:mi>t</mml:mi><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>0.36</mml:mn></mml:mrow></mml:msubsup><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>t</mml:mi></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-4">Eqs. (4)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-10">(10)</xref>, <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>V</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi></mml:math></inline-formula> is the corrosion rate in mm/a, <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>V</mml:mi><mml:mi>r</mml:mi></mml:math></inline-formula> is the reaction rate in mm/a, <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>V</mml:mi><mml:mi>m</mml:mi></mml:math></inline-formula> is the mass transfer rate in mm/a, <italic>t</italic> is the media temperature in &#x00B0;C, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>P</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> is the CO<sub>2</sub> partial pressure in MPa, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> is the actual pH, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> is the pH of CO<sub>2</sub>-saturated solvent, <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>V</mml:mi><mml:mi>l</mml:mi></mml:math></inline-formula> is the liquid phase flow rate of the medium in m/s,<inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>d</mml:mi></mml:math></inline-formula> is the pipe diameter in m, <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>f</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> is the fugacity of CO<sub>2</sub>, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>K</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> are temperature-dependent constants, <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mrow><mml:mi>w</mml:mi></mml:math></inline-formula> is the wall shear force in Pa, and <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>H</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>t</mml:mi></mml:math></inline-formula> is the pH-dependence term.</p>
<p>As the parameters of the de Waard model include the pipe diameter, the effect of which cannot be measured under laboratory conditions, the Norsok M506 model was selected as the control model for this study, as it is a purely empirical model based on a large amount of experimental data. The remaining data after the exclusion of the discrete values were selected for the comparison of prediction results, and the MAE value was used as the performance metric. For the ensemble learning algorithms, the average values of MAE for each model under three random seeds were compared. <xref ref-type="fig" rid="fig-12">Fig. 12</xref> shows that the MSE error value of the bagging &#x002B; KNN framework recommended in this study is only 0.0346, while the error value of the traditional Norsork M506 model is 0.1322. This framework is also slightly better than other ensemble models.</p>
<fig id="fig-12"><label>Figure 12</label><caption><title>Error comparison results between the traditional and ensemble models</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-12.png"/></fig>
</sec>
<sec id="s4_5"><label>4.5</label><title>Comparison Results of Prediction Errors between Ensemble Methods and Stacking Models</title>
<p>The Stacking [<xref ref-type="bibr" rid="ref-61">61</xref>] model allows any combination of different types of models. The integration strategy is to first use the K-fold cross-validation method on the original data to predict the model that needs to be integrated. After obtaining the predicted value of each integrated unit, the average value of the predicted results is used as input into a simpler prediction model (the purpose of this is to prevent overfitting), and the predicted value is the final result. The ensemble units of this ensemble model are usually selected as bagging model and boosting model. In this section, a total of 11 ensemble models (with the addition of the XGBoost [<xref ref-type="bibr" rid="ref-62">62</xref>] model) were used as base models to be integrated into the Stacking mode, in this experiment, to control the number of combinations, each model was only used once. <xref ref-type="fig" rid="fig-13">Fig. 13</xref> shows the prediction errors under different combinations. From this figure, it can be seen that with the increase of the number of combinations, the prediction error values become more concentrated, and almost every segment has similar error variation rules. <xref ref-type="table" rid="table-4">Table 4</xref> shows the minimum values of MSE for different combinations and the percentage of their predictions that are improved (relative to the models involved in the combination). When the random splitting seed is 666, under the premise that the prediction error of the basic model is larger, the integrated prediction error is further enlarged, which leads to the fact that the prediction performance of almost all models hardly improves under this data set partition condition. If only the random splitting seeds 222 and 444 are analyzed, as the increase of integration times, the probability that the prediction effect after integration is better than that of a single sub-model will gradually decrease. These results show that even if there is a combination of better prediction effects in stacking, regardless of the prediction performance of the model or the proportion of prediction performance improvement, the stacking integration mode is not very suitable for small sample prediction under this data.</p>
<fig id="fig-13"><label>Figure 13</label><caption><title>Combined prediction error plot in stacking model. a, b, c indicate the prediction results when the random splitting seed is 222, 444, 666, respectively; the horizontal coordinates of the graph indicate the number of models involved in the combination</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-13a.png"/><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_20220-fig-13b.png"/></fig>
<table-wrap id="table-4"><label>Table 4</label><caption><title>Comparison table of prediction performance of each combination in Stacking model</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead valign="top">
<tr>
<th align="left" rowspan="2">Number of<break/>combinations</th>
<th align="left" colspan="2">Random_state&#x2009;&#x003D;&#x2009;222</th>
<th align="left" colspan="2">Random_state&#x2009;&#x003D;&#x2009;444</th>
<th align="left" colspan="2">Random_state&#x2009;&#x003D;&#x2009;666</th>
<th align="left" rowspan="2">Average</th>
</tr>
<tr>
<th align="left">Minimum of MSE</th>
<th align="left">Percentage of result improvement</th>
<th align="left">Minimum of MSE</th>
<th align="left">Percentage of result improvement</th>
<th align="left">Minimum of MSE</th>
<th align="left">Percentage of result improvement</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">2</td>
<td align="left">0.001262</td>
<td align="left">11/55</td>
<td align="left">0.001559</td>
<td align="left">23/55</td>
<td align="left">0.002294</td>
<td align="left">9/55</td>
<td align="left">0.31</td>
</tr>
<tr>
<td align="left">3</td>
<td align="left">0.001067</td>
<td align="left">30/165</td>
<td align="left">0.00146</td>
<td align="left">49/165</td>
<td align="left">0.002227</td>
<td align="left">7/165</td>
<td align="left">0.24</td>
</tr>
<tr>
<td align="left">4</td>
<td align="left">0.001022</td>
<td align="left">57/330</td>
<td align="left">0.001419</td>
<td align="left">76/330</td>
<td align="left">0.0022</td>
<td align="left">2/330</td>
<td align="left">0.20</td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">0.001016</td>
<td align="left">77/462</td>
<td align="left">0.001391</td>
<td align="left">71/462</td>
<td align="left">0.002218</td>
<td align="left">0/462</td>
<td align="left">0.16</td>
</tr>
<tr>
<td align="left">6</td>
<td align="left">0.001023</td>
<td align="left">71/462</td>
<td align="left">0.0014</td>
<td align="left">48/462</td>
<td align="left">0.002218</td>
<td align="left">0/462</td>
<td align="left">0.13</td>
</tr>
<tr>
<td align="left">7</td>
<td align="left">0.001045</td>
<td align="left">45/330</td>
<td align="left">0.001439</td>
<td align="left">18/330</td>
<td align="left">0.0027</td>
<td align="left">0/330</td>
<td align="left">0.10</td>
</tr>
<tr>
<td align="left">8</td>
<td align="left">0.001072</td>
<td align="left">18/165</td>
<td align="left">0.001538</td>
<td align="left">3/165</td>
<td align="left">0.002833</td>
<td align="left">0/165</td>
<td align="left">0.06</td>
</tr>
<tr>
<td align="left">9</td>
<td align="left">0.00112</td>
<td align="left">3/55</td>
<td align="left">0.001635</td>
<td align="left">0/55</td>
<td align="left">0.002973</td>
<td align="left">0/55</td>
<td align="left">0.03</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Discussion</title>
<sec id="s5_1"><label>5.1</label><title>Predictive Performance of the Framework</title>
<p>In this study, the ensemble learning algorithms with the smallest MSE were found to be bagging&#x2002;&#x002B;&#x2002;KNN, extra trees, and bagging &#x002B; CART, which were all based on the bagging framework. This shows that the bagging algorithm is superior to boosting algorithm in small data set prediction. The results obtained here are consistent with Wang et al. [<xref ref-type="bibr" rid="ref-44">44</xref>], those who found that the bagging method was superior to boosting method in classification, and the former performed relatively better in the presence of noise. However, when using ensemble learning to analyze remote sensing images, Both Chan et al. and DeFries et al. [<xref ref-type="bibr" rid="ref-63">63</xref>,<xref ref-type="bibr" rid="ref-64">64</xref>] found that the AdaBoost algorithm had higher accuracy and that the bagging algorithm is more stable. The reason for the differences between the aforementioned studies and the present study is that in the latter, there are some instances where the discrete values (i.e., noise) are not completely removed, which is more detrimental to boosting methods [<xref ref-type="bibr" rid="ref-65">65</xref>]. The differences between the optimization strategies of the two algorithms also contributed to their different prediction performance. The bagging algorithm uses a parallel strategy to average the errors of some discrete values while boosting uses a serial strategy to increase the weight of the discrete value errors present in the training set during the optimization process. It is worth noting that among different models, the random forest algorithm has the lowest prediction rank, which is mainly due to the large prediction error of the last random seed. Bagging &#x002B; KNN framework has achieved the best prediction results, but under the condition of small samples, data plays a vital role in the results, so the performance of other data sets is still worth exploring.</p>
</sec>
<sec id="s5_2"><label>5.2</label><title>Exploratory Analysis of Influencing Factors</title>
<p>In this experiment, the prediction error of each model is reduced after the discrete values are eliminated, because when the amount of data with some special features is small, the distribution of these data relative to other data forms a small disjunctive area which may not be visible [<xref ref-type="bibr" rid="ref-66">66</xref>,<xref ref-type="bibr" rid="ref-67">67</xref>]; Therefore, it is not easy to determine whether the values in these areas are true values or noise [<xref ref-type="bibr" rid="ref-68">68</xref>], and the existence of such values leads to insufficient support of the prediction bounds, making it more difficult for the learning algorithm to achieve good generalization [<xref ref-type="bibr" rid="ref-26">26</xref>]. Therefore, excluding these values not only reduces the number of small intermittent areas but also has a positive impact on the model&#x0027;s generalization (the model&#x0027;s generalization is inherently weak under the condition of small samples). Admittedly, the model fitting in this study has some defects in the elimination of discrete value. Discrete rejection based on corrosion rate alone would have resulted in the incomplete rejection of discrete values as well as possible rejection of non-discrete values. The reason why this culling is not based on the feature values is that the data set has multiple dimensions, and discrete culling of each feature may lead to a small sample size, thus leading to a larger prediction errors. In the model applicability evaluation, when the random seed of the control partition is 666, the prediction error increases significantly, which is likely due to the discrete values in the test set being mixed with those that have not been removed. Therefore, when the study sample is small, more rigorous rules for determining the discrete values should be established. In addition, when building a prediction model based on small data set, when summarizing the model, the verification data should be kept as small as possible with respect to the training set.</p>
<p>The research shows that dimension reduction is helpful to improve the prediction accuracy of the model. In our experiment, when 90&#x0025; of the information dimension are retained, the prediction effect is the best, while when other information dimensions are retained, the result fluctuates greatly. This shows that although PCA is a common method for dimensionality reduction of small sample data [<xref ref-type="bibr" rid="ref-69">69</xref>], there is no uniform standard for how much information should be retained in different scenarios, which should be selected according to the actual situation.</p>
<p>In the experiment of the influence of segmentation ratio on the prediction error, we found that with the decrease of the training set ratio, the prediction error becomes larger. This is because when the number of training sets decreases, over-fitting is more likely to occur, which will lead to poor generalization ability of the model. This is consistent with the current mainstream research [<xref ref-type="bibr" rid="ref-67">67</xref>], so whether it is a small sample condition or not, increasing the amount of training data will always help to predict the results.</p>
</sec>
<sec id="s5_3"><label>5.3</label><title>Comparison with Other Models</title>
<p>Although many researchers believe that ensemble learning is superior to individual learners and helps solve small-sample problems [<xref ref-type="bibr" rid="ref-27">27</xref>], the results from <xref ref-type="fig" rid="fig-11">Fig. 11</xref> shows this is not necessarily the case. Windeatt [<xref ref-type="bibr" rid="ref-70">70</xref>] put forward that, to obtain a good integration, the conditions of accuracy and diversity must necessarily be satisfied. Therefore, when the dataset is small and contains some discrete values, ensemble learning methods should not be used blindly. In such cases, if the bagging algorithm is not able to achieve good prediction results, it is worth trying individual learners.</p>
<p><xref ref-type="fig" rid="fig-12">Fig. 12</xref> shows that the predicted MAE value of the Norsok M506 model was 0.1322, which is significantly larger than that of the ensemble learning algorithm. This indicates that ensemble learning still performs better than traditional empirical models even under small sample conditions. In this case, the results from the Norsok M506 model are relatively conservative compared to other traditional models, which results in larger predicted corrosion rates and thus higher prediction errors [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p>The results of <xref ref-type="fig" rid="fig-13">Fig. 13</xref> show that the floating range of the prediction error becomes smaller as the number of combinations increases in the Stacking combination model. This means that if one wants to integrate with the Stacking model, it is not better to have more model combinations, rather a smaller number of combinations is more likely to make the combined prediction error decrease. In most cases of this part of the experiment, the prediction performance after using the stacking combination did not outperform the prediction performance of the underlying ensemble model, indicating that using the Stacking algorithm does not further improve the predictive power of the model. As stated by other researchers [<xref ref-type="bibr" rid="ref-71">71</xref>], improving prediction performance using Stacking models requires that the models involved in the integration are differentiated and have good performance, and these conditions cannot be met in the small sample condition. The results in this section show that although the prediction error values of the bagging &#x002B; KNN framework are still largely under small sample conditions, it is more suitable for prediction under small sample conditions than the traditional model and the more complex Stacking model.</p>
</sec>
</sec>
<sec id="s6"><label>6</label><title>Conclusions</title>
<p>In this paper, a bagging ensemble framework based on knn-based learners is proposed to predict metal corrosion rates under small laboratory sample conditions in the absence of real oil and gas pipeline data. The two most important steps in the experimental preprocessing step are PCA dimensionality reduction and boxplot-based removal of discrete values. These two steps are aimed at optimizing the data set and reducing the influence of extraneous factors and noise on the experiment.</p>
<p>99 data sets are used for training and testing at the ratio of 9: 1. Based on this data set, other models are compared and analyzed. The results show that when the MAE value of the Norsok model is 0.1322, the error of the bagging &#x002B; KNN framework is only 0.0346, and the error value of this framework is slightly smaller than that of other integrated models, indicating that this framework has certain advantages in the scene. The stacking mode is more complicated but the prediction effect is not ideal, so it is not recommended to use stacking under the condition of small samples. In addition, the effect of only using the individual learner as the prediction model in this experiment is not as bad as imagined, so when the performance of the computing equipment is poor, and the prediction results of the integrated model are not ideal, you may try the individual learner.</p>
<p>The effects of various factors on the experimental results are discussed. The results show that using box-plot to remove discrete values and reduce the dimension moderately (the best result in this experiment is to keep the dimension of 90&#x0025; information), and increasing the number of training samples can improve prediction performance to some extent. In addition, our research found that the error value of bagging mode is often smaller than that of Boosting mode in the small sample scenario of this experiment, which indicates that bagging has greater advantages in this scenario.</p>
<p>Although the performance of the proposed framework on this data set is better than that of other models, there is still a large error value on the test set, so the generalization ability of this model is still worth discussing. Under the condition of small samples, slight changes of data may cause a great interference to the results, so this framework only provides an idea to study the corrosion of oil and gas pipelines under the condition of small samples. It is worth noting that the framework removes discrete values and reduces the fluctuation range of the data set, so its generalization ability will be limited when it is used to verify new data sets. Therefore, for future researchers, it is very important to explore the effect of removing outliers in small samples in more detail and how to improve the generalization ability of the model.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Availability of Data and Materials:</bold> The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.</p></fn>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This work was supported by the National Natural Science Foundation of China (Grant No. 52174062).</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> It should be understood that none of the authors have any financial or scientific conflicts of interest concerning the research described in this manuscript.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>1.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Totlani</surname>, <given-names>M. K.</given-names></string-name>, <string-name><surname>Athavale</surname>, <given-names>S. N.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Electroless nickel for corrosion control in chemical, oil and gas industries</article-title>. <source>Corrosion Reviews</source><italic>,</italic> <volume>18</volume><issue>(2&#x2013;3)</issue><italic>,</italic> <fpage>155</fpage>&#x2013;<lpage>179</lpage>. DOI <pub-id pub-id-type="doi">10.1515/CORRREV.2000.18.2-3.155</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>2.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname>, <given-names>Z. D.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Shao</surname>, <given-names>L. W.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Damage identification of pipeline based on ultrasonic guided wave and wavelet denoising</article-title>. <source>Journal of Pipeline Systems Engineering &#x0026; Practice</source><italic>,</italic> <volume>12</volume><issue>(4)</issue><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>14</lpage>. DOI <pub-id pub-id-type="doi">10.1061/(ASCE)PS.1949-1204.0000600</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>3.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guillal</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Seghier</surname>, <given-names>M. E. A. B.</given-names></string-name>, <string-name><surname>Nourddine</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Correia</surname>, <given-names>J. A. F. O.</given-names></string-name>, <string-name><surname>Mustaffa</surname>, <given-names>Z. B.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Probabilistic investigation on the reliability assessment of mid-and high-strength pipelines under corrosion and fracture conditions</article-title>. <source>Engineering Failure Analysis</source><italic>,</italic> <volume>118</volume><italic>,</italic> <fpage>104891</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.engfailanal.2020.104891</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>4.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Seghier</surname>, <given-names>M. E. A. B.</given-names></string-name>, <string-name><surname>Keshtegar</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Taleb-Berrouane</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Abbassi</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Trung</surname>, <given-names>N. T.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Advanced intelligence frameworks for predicting maximum pitting corrosion depth in oil and gas pipelines</article-title>. <source>Process Safety and Environmental Protection</source><italic>,</italic> <volume>147</volume><italic>,</italic> <fpage>818</fpage>&#x2013;<lpage>833</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.psep.2021.01.008</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>5.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pots</surname>, <given-names>B. F. M.</given-names></string-name>, <string-name><surname>John</surname>, <given-names>R. C.</given-names></string-name>, <string-name><surname>Rippon</surname>, <given-names>I. J.</given-names></string-name>, <string-name><surname>Thomas</surname>, <given-names>M. J. J. S.</given-names></string-name>, <string-name><surname>Kapusta</surname>, <given-names>S. D.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2002</year>). <article-title>Improvements on de waard-milliams corrosion prediction and applications to corrosion management</article-title>. <conf-name>NACE-International Corrosion Conference Series</conf-name>, <conf-loc>Denver, Colorado</conf-loc>.</mixed-citation></ref>
<ref id="ref-6"><label>6.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>de Waard</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Lotz</surname>, <given-names>U.</given-names></string-name>, <string-name><surname>Milliams</surname>, <given-names>D. E.</given-names></string-name></person-group> (<year>1991</year>). <article-title>Predictive model for CO<sub>2</sub> corrosion engineering in wet natural gas pipelines</article-title>. <source>CORROSION</source><italic>,</italic> <volume>47</volume><issue>(12)</issue><italic>,</italic> <fpage>976</fpage>&#x2013;<lpage>985</lpage>. DOI <pub-id pub-id-type="doi">10.5006/1.3585212</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>7.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hedges</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2000</year>). <article-title>The corrosion inhibitor availability model</article-title>. <conf-name>NACE International Annual Conference &#x0026; Exposition</conf-name>, <conf-loc>Orlando, Florida, USA</conf-loc>.</mixed-citation></ref>
<ref id="ref-8"><label>8.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Moghissi</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Burwell</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Eckert</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Vera</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Sridhar</surname>, <given-names>N.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2004</year>). <article-title>Internal corrosion direct assessment for pipelines carrying wet gas-methodology</article-title>. <source>2004 International Pipeline Conference</source>. DOI <pub-id pub-id-type="doi">10.1115/IPC2004-0552</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>9.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Olsen</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2003</year>). <article-title>CO<sub>2</sub> corrosion prediction by use of the norsok M-506 model guidelines and limitations</article-title>. <conf-name>CORROSION 2003</conf-name>, <conf-loc>San Diego, California</conf-loc>.</mixed-citation></ref>
<ref id="ref-10"><label>10.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Iseley</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Matthews</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Liao</surname>, <given-names>W.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Hybrid machine learning for pullback force forecasting during horizontal directional drilling</article-title>. <source>Automation in Construction</source><italic>,</italic> <volume>129</volume><italic>,</italic> <fpage>103810</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.autcon.2021.103810</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>11.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Duong</surname>, <given-names>H. T.</given-names></string-name>, <string-name><surname>Phan</surname>, <given-names>H. C.</given-names></string-name>, <string-name><surname>Tran</surname>, <given-names>T. M.</given-names></string-name>, <string-name><surname>Dhar</surname>, <given-names>A. S.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Assessment of critical buckling load of functionally graded plates using artificial neural network modeling</article-title>. <source>Neural Computing &#x0026; Applications</source><italic>,</italic> <volume>33</volume><issue>(23)</issue><italic>,</italic> <fpage>1--13</fpage>. DOI <pub-id pub-id-type="doi">10.1007/s00521-021-06238-61-13</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>12.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Seghier</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Corriea</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Jafari-Asl</surname> <given-names>J.</given-names></string-name>, <string-name><surname>Malekjafarian</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Trung</surname>, <given-names>N. T.</given-names></string-name></person-group> (<year>2021</year>). <article-title>On the modeling of the annual corrosion rate in main cables of suspension bridges using combined soft computing model and a novel nature-inspired algorithm</article-title>. <source>Neural Computing &#x0026; Applications</source><italic>,</italic> <volume>33</volume><issue>(23)</issue><italic>,</italic> <fpage>15969</fpage>&#x2013;<lpage>15985</lpage>. DOI <pub-id pub-id-type="doi">10.1007/s00521-021-06199-w1-17</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>13.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jain</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>S&#x00E1;nchez</surname>, <given-names>A. N.</given-names></string-name>, <string-name><surname>Guan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Ayello</surname>, <given-names>F.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2015</year>). <article-title>Probabilistic assessment of external corrosion rates in buried oil and gas pipelines</article-title>. <conf-name>NACE-International Corrosion Conference Series</conf-name>, <conf-loc>Dallas, Texas</conf-loc>.</mixed-citation></ref>
<ref id="ref-14"><label>14.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abbas</surname>, <given-names>M. H.</given-names></string-name>, <string-name><surname>Norman</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Charles</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Neural network modelling of high pressure CO<sub>2</sub> corrosion in pipeline steels</article-title>. <source>Process Safety &#x0026; Environmental Protection: Transactions of the Institution of Chemical Engineers Part B</source><italic>,</italic> <volume>119</volume><italic>,</italic> <fpage>36</fpage>&#x2013;<lpage>45</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.psep.2018.07.006</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>15.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ossai</surname>, <given-names>C. I.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Corrosion defect modelling of aged pipelines with a feed-forward multi-layer neural network for leak and burst failure estimation</article-title>. <source>Engineering Failure Analysis</source><italic>,</italic> <volume>110</volume><italic>,</italic> <fpage>104397</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.engfailanal.2020.104397</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>16.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>Z.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Principal component analysis based dynamic fuzzy neural network for internal corrosion rate prediction of gas pipelines</article-title>. <source>Mathematical Problems in Engineering</source><italic>,</italic> <volume>2020</volume><issue>(1)</issue><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>9</lpage>. DOI <pub-id pub-id-type="doi">10.1155/2020/3681032</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>17.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kishawy</surname>, <given-names>H. A.</given-names></string-name>, <string-name><surname>Gabbar</surname>, <given-names>H. A.</given-names></string-name></person-group> (<year>2010</year>). <article-title>Review of pipeline integrity management practices</article-title>. <source>International Journal of Pressure Vessels and Piping</source><italic>,</italic> <volume>87</volume><issue>(7)</issue><italic>,</italic> <fpage>373</fpage>&#x2013;<lpage>380</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.ijpvp.2010.04.003</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>18.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vanaei</surname>, <given-names>H. R.</given-names></string-name>, <string-name><surname>Eslami</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Egbewande</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2017</year>). <article-title>A review on pipeline corrosion, in-line inspection (ILI), and corrosion growth rate models</article-title>. <source>International Journal of Pressure Vessels and Piping</source><italic>,</italic> <volume>149</volume><issue>(16)</issue><italic>,</italic> <fpage>43</fpage>&#x2013;<lpage>54</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.ijpvp.2016.11.007</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>19.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Zeng</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Guo</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Jiang</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Shi</surname>, <given-names>D.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2012</year>). <article-title>Electrochemical corrosion behavior of carbon steel under dynamic high pressure H<sub>2</sub>S/CO<sub>2</sub> environment</article-title>. <source>Corrosion Science</source><italic>,</italic> <volume>65</volume><italic>,</italic> <fpage>37</fpage>&#x2013;<lpage>47</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.corsci.2012.08.007</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>20.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhu</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Chien</surname>, <given-names>J.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2019</year>). <article-title>Image-text dual neural network with decision strategy for small-sample image classification</article-title>. <source>Neurocomputing</source><italic>,</italic> <volume>328</volume><italic>,</italic> <fpage>182</fpage>&#x2013;<lpage>188</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.neucom.2018.02.099</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>21.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Angshuman Paul</surname>, <given-names>Y. X. T.</given-names></string-name>, <string-name><surname>Thomas</surname>, <given-names>C. S.</given-names></string-name>, <string-name><surname>Ronald</surname>, <given-names>M. S.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Discriminative ensemble learning for few-shot chest x-ray diagnosis</article-title>. <source>Medical Image Analysis</source><italic>,</italic> <volume>68</volume><italic>,</italic> <fpage>101911</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.media.2020.101911</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>22.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Well log generation via ensemble long short-term memory (EnLSTM) network</article-title>. <source>Geophysical Research Letters</source><italic>,</italic> <volume>47</volume><issue>(23)</issue><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>9</lpage>. DOI <pub-id pub-id-type="doi">10.1029/2020GL087685</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>23.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gu</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Qiao</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Ensemble meta-learning for few-shot soot density recognition</article-title>. <source>IEEE Transactions on Industrial Informatics</source><italic>,</italic> <volume>17</volume><issue>(3)</issue><italic>,</italic> <fpage>2261</fpage>&#x2013;<lpage>2270</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TII.9424</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>24.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Mahdavi-Shahri</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Karimian</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Javadi</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Houshmand</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Multi-Label Classification of Small Samples Using an Ensemble Technique</article-title>, <source>Iranian Conference on Electrical Engineering (ICEE)</source><italic>,</italic> Mashhad, Iran.</mixed-citation></ref>
<ref id="ref-25"><label>25.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Elmousalami</surname>, <given-names>H. H.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Comparison of artificial intelligence techniques for project conceptual cost prediction: A case study and comparative analysis</article-title>. <source>IEEE Transactions on Engineering Management</source><italic>,</italic> <volume>68</volume><issue>(1)</issue><italic>,</italic> <fpage>183</fpage>&#x2013;<lpage>196</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TEM.17</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>26.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lopez</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Fernandez</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Garcia</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2013</year>). <article-title>An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics</article-title>. <source>Information Sciences</source><italic>,</italic> <volume>250</volume><italic>,</italic> <fpage>113</fpage>&#x2013;<lpage>141</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.ins.2013.07.007</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>27.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dvornik</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Schmid</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Mairal</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2019</year>). <article-title>Diversity with cooperation: Ensemble methods for few-shot classification</article-title>. <source>IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences</source><italic>,</italic> pp. <fpage>3723</fpage>&#x2013;<lpage>3731</lpage>.</mixed-citation></ref>
<ref id="ref-28"><label>28.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Feng, Y. P., Pang, T. F., Li, M. Q., Guan, Y. Y.</surname></string-name></person-group> (<year>2020</year>). <article-title>Small sample face recognition based on ensemble deep learning</article-title>. 
<conf-name>2020 Chinese Control And Decision Conference (CCDC)</conf-name>, 
<publisher-name>IEEE</publisher-name>.</mixed-citation></ref>
<ref id="ref-29"><label>29.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Breiman</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1996</year>). <article-title>Bagging predictors</article-title>. <source>Machine Learning</source><italic>,</italic> <volume>24</volume><issue>(2)</issue><italic>,</italic> <fpage>123</fpage>&#x2013;<lpage>140</lpage>. DOI <pub-id pub-id-type="doi">10.1007/BF00058655</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>30.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Freund</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Schapire</surname>, <given-names>R. E.</given-names></string-name></person-group> (<year>1997</year>). <article-title>A decision-theoretic generalization of on-line learning and an application to boosting</article-title>. <source>Journal of Computer and System Sciences</source><italic>,</italic> <volume>55</volume><issue>(1)</issue><italic>,</italic> <fpage>119</fpage>&#x2013;<lpage>139</lpage>. DOI <pub-id pub-id-type="doi">10.1006/jcss.1997.1504</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>31.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Iwabuchi</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>de H. e Ayres de Moura</surname>, <given-names>A. A.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Ensemble meteorological cloud classification meets internet of dependable and controllable things</article-title>. <source>IEEE Internet of Things Journal</source><italic>,</italic> <volume>8</volume><issue>(5)</issue><italic>,</italic> <fpage>3323</fpage>&#x2013;<lpage>3330</lpage>. DOI <pub-id pub-id-type="doi">10.1109/jiot.2020.30432891</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>32.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname>, <given-names>D. C.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>Z. T.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X. D.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Chang</surname>, <given-names>J. Q.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2020</year>). <article-title>Machine learning-based compressive strength prediction for concrete: An adaptive boosting approach</article-title>. <source>Construction &#x0026; Building Materials</source><italic>,</italic> <volume>230</volume><italic>,</italic> <fpage>117000</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.conbuildmat.2019.117000</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>33.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chan</surname>, <given-names>J. C. W.</given-names></string-name></person-group> (<year>2008</year>). <article-title>Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery</article-title>. <source>Remote Sensing of Environment</source><italic>,</italic> <volume>112</volume><issue>(6)</issue><italic>,</italic> <fpage>2999</fpage>&#x2013;<lpage>3011</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.rse.2008.02.011</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>34.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Azhari</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Abarda</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Alaoui</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ettaki</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Zerouaoui</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Detection of pulsar candidates using bagging method</article-title>. <source>Procedia Computer Science</source><italic>,</italic> <volume>170</volume><italic>,</italic> <fpage>1096</fpage>&#x2013;<lpage>1101</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.procs.2020.03.062</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>35.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abbes</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Sellami</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Marc-Zwecker</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Zanni-Merk</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Fuzzy decision ontology for melanoma diagnosis using KNN classifier</article-title>. <source>Multimedia Tools and Applications</source><italic>,</italic> <volume>80</volume><issue>(17)</issue><italic>,</italic> <fpage>25517</fpage>&#x2013;<lpage>25538</lpage>. DOI <pub-id pub-id-type="doi">10.1007/s11042-021-10858-4</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>36.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Devi</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Thirumurugan</surname>, <given-names>N.</given-names></string-name></person-group> P. (<year>2021</year>). <article-title>Cervical cancer classification from pap smear images using modified fuzzy C means, PCA, and KNN</article-title>. <source>IETE Journal of Research</source><italic>,</italic> <volume>67</volume><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>8</lpage>. DOI <pub-id pub-id-type="doi">10.1080/03772063.2021.1997353</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>37.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liao</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>The effect of acetic acid on the localized corrosion of 3Cr steel in the CO<sub>2</sub>-saturated oilfield formation water</article-title>. <source>International Journal of Electrochemical Science</source><italic>,</italic> <volume>15</volume><italic>,</italic> <fpage>8622</fpage>&#x2013;<lpage>8637</lpage>. DOI <pub-id pub-id-type="doi">10.20964/2020.09.24</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>38.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liao</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Qin</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Study on corrosion mechanism and the risk of the shale gas gathering pipelines</article-title>. <source>Engineering Failure Analysis</source><italic>,</italic> <volume>128</volume><issue>(5)</issue><italic>,</italic> <fpage>105622</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.engfailanal.2021.105622</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>39.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Peng</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Zeng</surname>, <given-names>Z.</given-names></string-name></person-group> (<year>2015</year>). <article-title>An experimental study on the internal corrosion of a subsea multiphase pipeline</article-title>. <source>Petroleum</source><italic>,</italic> <volume>1</volume><issue>(1)</issue><italic>,</italic> <fpage>75</fpage>&#x2013;<lpage>81</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.petlm.2015.04.003</pub-id>.</mixed-citation></ref>
<ref id="ref-40"><label>40.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bendiksen</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Maines</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Moe</surname>, <given-names>R.</given-names></string-name></person-group> (<year>1991</year>). <article-title>The dynamic two-fluid model OLGA: Theory and application</article-title>. <source>SPE Production Engineering</source><italic>,</italic> <volume>6</volume><issue>(2)</issue><italic>,</italic> <fpage>171</fpage>&#x2013;<lpage>180</lpage>. DOI <pub-id pub-id-type="doi">10.2118/19451-PA</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>41.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hansen</surname>, <given-names>L. K.</given-names></string-name>, <string-name><surname>Salamon</surname>, <given-names>P.</given-names></string-name></person-group> (<year>1990</year>). <article-title>Neural network ensembles. pattern analysis and machine intelligence</article-title>. <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source><italic>,</italic> <volume>12</volume><issue>(10)</issue><italic>,</italic> <fpage>993</fpage>&#x2013;<lpage>1001</lpage>. DOI <pub-id pub-id-type="doi">10.1109/34.58871</pub-id>.</mixed-citation></ref>
<ref id="ref-42"><label>42.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Nayak</surname>, <given-names>D. R.</given-names></string-name>, <string-name><surname>Dash</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Majhi</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Brain MR image classification using two-dimensional discrete wavelet transform and adaboost with random forests</article-title>. <source>Neurocomputing</source><italic>,</italic> <volume>177</volume><italic>,</italic> <fpage>188</fpage>&#x2013;<lpage>197</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.neucom.2015.11.034</pub-id>.</mixed-citation></ref>
<ref id="ref-43"><label>43.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Aflah</surname>, <given-names>W. M.</given-names></string-name>, <string-name><surname>Zubir</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Aziz</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Jaafar</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2019</year>). <chapter-title>Evaluation of machine learning algorithms in predicting CO<sub>2</sub> internal corrosion in oil and gas pipelines</chapter-title>. In: <source>Computational and statistical methods in intelligent systems</source>, pp. 236&#x2013;254. Berlin, German: Springer.</mixed-citation></ref>
<ref id="ref-44"><label>44.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Hao</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Jiang</surname>, <given-names>H.</given-names></string-name></person-group> (<year>2011</year>). <article-title>A comparative assessment of ensemble learning for credit scoring</article-title>. <source>Expert Systems with Applications</source><italic>,</italic> <volume>38</volume><issue>(1)</issue><italic>,</italic> <fpage>223</fpage>&#x2013;<lpage>230</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.eswa.2010.06.048</pub-id>.</mixed-citation></ref>
<ref id="ref-45"><label>45.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Breiman</surname>, <given-names>L.</given-names></string-name></person-group> (<year>1998</year>). <article-title>Arcing classifier (with discussion and a rejoinder by the author)</article-title>. <source>The Annals of Statistics</source><italic>,</italic> <volume>26</volume><issue>(3)</issue><italic>,</italic> <fpage>801</fpage>&#x2013;<lpage>849</lpage>. DOI <pub-id pub-id-type="doi">10.1214/aos/1024691079</pub-id>.</mixed-citation></ref>
<ref id="ref-46"><label>46.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hechenbichler</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Schliep</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Chang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Lin</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2004</year>). <article-title>Weighted k-nearest-neighbor techniques and ordinal classification</article-title>. <source>ACM Transactions on Intelligent Systems and Technology</source><italic>,</italic> <volume>2</volume><issue>(3)</issue><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>39</lpage>. DOI <pub-id pub-id-type="doi">10.5282/ubm/epub.1769</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>47.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Peterson</surname>, <given-names>L.</given-names></string-name></person-group> (<year>2009</year>). <article-title>K-nearest neighbor</article-title>. <source>Scholarpedia</source><italic>,</italic> <volume>4</volume><italic>,</italic> <fpage>1883</fpage>. DOI <pub-id pub-id-type="doi">10.4249/scholarpedia.1883</pub-id>.</mixed-citation></ref>
<ref id="ref-48"><label>48.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Aha</surname>, <given-names>D. W.</given-names></string-name>, <string-name><surname>Kibler</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Albert</surname>, <given-names>M. K.</given-names></string-name></person-group> (<year>1991</year>). <article-title>Instance-based learning algorithms</article-title>. <source>Machine Learning</source><italic>,</italic> <volume>6</volume><issue>(1)</issue><italic>,</italic> <fpage>37</fpage>&#x2013;<lpage>66</lpage>. DOI <pub-id pub-id-type="doi">10.1007/BF00153759</pub-id>.</mixed-citation></ref>
<ref id="ref-49"><label>49.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Lin</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2011</year>). <article-title>LIBSVM: A library for support vector machines</article-title>. <source>ACM Transactions on Intelligent Systems and Technology</source><italic>,</italic> <volume>2</volume><issue>(3)</issue><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>39</lpage>. DOI <pub-id pub-id-type="doi">10.1145/1961189.1961199</pub-id>.</mixed-citation></ref>
<ref id="ref-50"><label>50.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Breiman</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Friedman</surname>, <given-names>J. H.</given-names></string-name>, <string-name><surname>Olshen</surname>, <given-names>R. A.</given-names></string-name>, <string-name><surname>Stone</surname>, <given-names>C. J.</given-names></string-name></person-group> (<year>1984</year>). <article-title>Classification and regression trees</article-title>. <source>Journal of the American Statistical Association</source><italic>,</italic> <volume>81</volume><italic>,</italic> <fpage>393</fpage>. DOI <pub-id pub-id-type="doi">10.2307/2530946</pub-id>.</mixed-citation></ref>
<ref id="ref-51"><label>51.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Deng</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Runger</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2013</year>). <article-title>Gene selection with guided regularized random forest</article-title>. <source>Pattern Recognition</source><italic>,</italic> <volume>46</volume><issue>(12)</issue><italic>,</italic> <fpage>3483</fpage>&#x2013;<lpage>3489</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.patcog.2013.05.018</pub-id>.</mixed-citation></ref>
<ref id="ref-52"><label>52.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Geurts</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Science</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Ernst</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Science</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wehenkel</surname>, <given-names>L.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2006</year>). <article-title>Extremely randomized trees</article-title>. <source>Machine Learning</source><italic>,</italic> <volume>36</volume><issue>(1)</issue><italic>,</italic> <fpage>3</fpage>&#x2013;<lpage>42</lpage>. DOI <pub-id pub-id-type="doi">10.1007/s10994-006-6226-1</pub-id>.</mixed-citation></ref>
<ref id="ref-53"><label>53.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Friedman</surname>, <given-names>J. H.</given-names></string-name></person-group> (<year>2001</year>). <article-title>Greedy function approximation: A gradient boosting machine</article-title>. <source>The Annals of Statistics</source><italic>,</italic> <volume>29</volume><issue>(5)</issue><italic>,</italic> <fpage>1189</fpage>&#x2013;<lpage>1232</lpage>. DOI <pub-id pub-id-type="doi">10.1214/aos/1013203451</pub-id>.</mixed-citation></ref>
<ref id="ref-54"><label>54.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Ke</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Meng</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Finley</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>W.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2017</year>). <article-title>LightGBM: A highly efficient gradient boosting decision tree</article-title>. In: <source>Advances in neural information processing systems 30 (NIPS 2017))</source>.</mixed-citation></ref>
<ref id="ref-55"><label>55.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kleiner</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Graedel</surname>, <given-names>T. E.</given-names></string-name></person-group> (<year>1980</year>). <article-title>Exploratory data analysis in the geophysical sciences</article-title>. <source>Reviews of Geophysics</source><italic>,</italic> <volume>18</volume><issue>(3)</issue><italic>,</italic> <fpage>699</fpage>&#x2013;<lpage>717</lpage>. DOI <pub-id pub-id-type="doi">10.1029/RG018i003p00699</pub-id>.</mixed-citation></ref>
<ref id="ref-56"><label>56.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Seyedzadeh</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Rahimian</surname>, <given-names>F. P.</given-names></string-name>, <string-name><surname>Oliver</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Glesk</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Data driven model improved by multi-objective optimisation for prediction of building energy loads</article-title>. <source>Automation in Construction</source><italic>,</italic> <volume>116</volume> <fpage>103188</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.autcon.2020.103188</pub-id>.</mixed-citation></ref>
<ref id="ref-57"><label>57.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Phan</surname>, <given-names>H. C.</given-names></string-name>, <string-name><surname>Duong</surname>, <given-names>H. T.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Predicting burst pressure of defected pipeline with principal component analysis and adaptive neuro fuzzy inference system</article-title>. <source>International Journal of Pressure Vessels and Piping</source><italic>,</italic> <volume>189</volume>, <fpage>104274</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.ijpvp.2020.104274</pub-id>.</mixed-citation></ref>
<ref id="ref-58"><label>58.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Kohavi</surname>, <given-names>R.</given-names></string-name></person-group> (<year>1995</year>). <article-title>A study of cross-validation and bootstrap for accuracy estimation and model selection</article-title>, <source>International Joint Conference on Artificial Intelligence</source>.</mixed-citation></ref>
<ref id="ref-59"><label>59.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Friedman</surname>, <given-names>M.</given-names></string-name></person-group> (<year>1937</year>). <article-title>The use of ranks to avoid the assumption of normality implicit in the analysis of variance</article-title>. <source>Journal of the American Statistical Association</source><italic>,</italic> <volume>32</volume><issue>(200)</issue><italic>,</italic> <fpage>675</fpage>&#x2013;<lpage>701</lpage>. DOI <pub-id pub-id-type="doi">10.1080/01621459.1937.10503522</pub-id>.</mixed-citation></ref>
<ref id="ref-60"><label>60.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>Z. D.</given-names></string-name>, <string-name><surname>Iseley</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Matthews</surname>, <given-names>J. C.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Novel data-driven framework for predicting residual strength of corroded pipelines</article-title>. <source>Journal of Pipeline Systems Engineering &#x0026; Practice</source><italic>,</italic> <volume>12</volume><issue>(4)</issue><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>10</lpage>. DOI <pub-id pub-id-type="doi">10.1061/(ASCE)PS.1949-1204.0000587</pub-id>.</mixed-citation></ref>
<ref id="ref-61"><label>61.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Menahem</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Rokach</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Elovici</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2009</year>). <article-title>Troika&#x2013;An improved stacking schema for classification tasks</article-title>. <source>Information Sciences</source><italic>,</italic> <volume>179</volume><issue>(24)</issue><italic>,</italic> <fpage>4097</fpage>&#x2013;<lpage>4122</lpage>. DOI <pub-id pub-id-type="doi">10.1016/j.ins.2009.08.025</pub-id>.</mixed-citation></ref>
<ref id="ref-62"><label>62.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Guestrin</surname>, <given-names>C.</given-names></string-name></person-group> (<year>2016</year>). <article-title>XGBoost: A scalable tree boosting system</article-title>, <source>22nd ACM SIGKDD International Conference</source>.</mixed-citation></ref>
<ref id="ref-63"><label>63.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chan</surname>, <given-names>J. C. W.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>DeFries</surname>, <given-names>R.</given-names></string-name></person-group> (<year>2001</year>). <article-title>Enhanced algorithm performance for land cover classification from remotely sensed data using bagging and boosting</article-title>. <source>IEEE Transactions on Geoscience and Remote Sensing</source><italic>,</italic> <volume>39</volume><issue>(3)</issue><italic>,</italic> <fpage>693</fpage>&#x2013;<lpage>693</lpage>. DOI <pub-id pub-id-type="doi">10.1109/36.911126</pub-id>.</mixed-citation></ref>
<ref id="ref-64"><label>64.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>DeFries</surname>, <given-names>R. S.</given-names></string-name>, <string-name><surname>Chan</surname>, <given-names>J. C. W.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Multiple criteria for evaluating machine learning algorithms for land cover classification from satellite data</article-title>. <source>Remote Sensing of Environment</source><italic>,</italic> <volume>74</volume><issue>(3)</issue><italic>,</italic> <fpage>503</fpage>&#x2013;<lpage>515</lpage>. DOI <pub-id pub-id-type="doi">10.1016/S0034-4257(00)00142-5</pub-id>.</mixed-citation></ref>
<ref id="ref-65"><label>65.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Briem</surname>, <given-names>G. J.</given-names></string-name>, <string-name><surname>Benediktsson</surname>, <given-names>J. A.</given-names></string-name>, <string-name><surname>Sveinsson</surname>, <given-names>J. R.</given-names></string-name></person-group> (<year>2002</year>). <article-title>Multiple classifiers applied to multisource remote sensing data</article-title>. <source>IEEE Transactions on Geoscience and Remote Sensing</source><italic>,</italic> <volume>40</volume><issue>(10)</issue><italic>,</italic> <fpage>2291</fpage>&#x2013;<lpage>2299</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TGRS.2002.802476</pub-id>.</mixed-citation></ref>
<ref id="ref-66"><label>66.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Orriols-Puig</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Casillas</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Bernad&#x00F3;-Mansilla</surname>, <given-names>E.</given-names></string-name></person-group> (<year>2009</year>). <article-title>Fuzzy-UCS: A michigan-style learning fuzzy-classifier system for supervised learning</article-title>. <source>IEEE Transactions on Evolutionary Computation</source><italic>,</italic> <volume>13</volume><issue>(2)</issue><italic>,</italic> <fpage>260</fpage>&#x2013;<lpage>283</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TEVC.2008.925144</pub-id>.</mixed-citation></ref>
<ref id="ref-67"><label>67.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Weiss</surname>, <given-names>G. M.</given-names></string-name>, <string-name><surname>Provost</surname>, <given-names>F.</given-names></string-name></person-group> (<year>2003</year>). <article-title>Learning when training data are costly: The effect of class distribution on tree induction</article-title>. <source>Journal of Artificial Intelligence Research</source><italic>,</italic> <volume>19</volume><italic>,</italic> <fpage>315</fpage>&#x2013;<lpage>354</lpage>. DOI <pub-id pub-id-type="doi">10.1613/jair.1199</pub-id>.</mixed-citation></ref>
<ref id="ref-68"><label>68.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jo</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Japkowicz</surname>, <given-names>N.</given-names></string-name></person-group> (<year>2004</year>). <article-title>Class imbalances versus small disjuncts</article-title>. <source>ACM SIGKDD Explorations Newsletter</source><italic>,</italic> <volume>6</volume><issue>(1)</issue><italic>,</italic> <fpage>40</fpage>&#x2013;<lpage>49</lpage>. DOI <pub-id pub-id-type="doi">10.1145/1007730.1007737</pub-id>.</mixed-citation></ref>
<ref id="ref-69"><label>69.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lameiro</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Schreier</surname>, <given-names>P. J.</given-names></string-name></person-group> (<year>2017</year>). <article-title>A sparse CCA algorithm with application to model-order selection for small sample support</article-title>. <conf-name>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</conf-name>, pp. <fpage>4721</fpage>&#x2013;<lpage>4725</lpage>. <conf-loc>New Orleans, LA, USA</conf-loc>. </mixed-citation></ref>
<ref id="ref-70"><label>70.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Windeat</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Ardeshir</surname>, <given-names>G.</given-names></string-name></person-group> (<year>2004</year>). <article-title>Decision tree simplification for classifier ensembles</article-title>. <source>International Journal of Pattern Recognition and Artificial Intelligence</source><italic>,</italic> <volume>18</volume><issue>(5)</issue><italic>,</italic> <fpage>749</fpage>&#x2013;<lpage>776</lpage>. DOI <pub-id pub-id-type="doi">10.1142/S021800140400340X</pub-id>.</mixed-citation></ref>
<ref id="ref-71"><label>71.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>D&#x017E;eroski</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>&#x017D;enko</surname>, <given-names>B.</given-names></string-name></person-group> (<year>2004</year>). <article-title>Is combining classifiers with stacking better than selecting the best one?</article-title> <source>Machine Learning</source><italic>,</italic> <volume>54</volume><issue>(3)</issue><italic>,</italic> <fpage>255</fpage>&#x2013;<lpage>273</lpage>. DOI <pub-id pub-id-type="doi">10.1023/B:MACH.0000015881.36452.6e</pub-id>.</mixed-citation></ref>
</ref-list>
</back>
</article>