<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">JIMH</journal-id>
<journal-id journal-id-type="nlm-ta">JIMH</journal-id>
<journal-id journal-id-type="publisher-id">JIMH</journal-id>
<journal-title-group>
<journal-title>Journal of Intelligent Medicine and Healthcare</journal-title>
</journal-title-group>
<issn pub-type="epub">2837-634X</issn>
<issn pub-type="ppub">2837-6331</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">46995</article-id>
<article-id pub-id-type="doi">10.32604/jimh.2023.046995</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Work Review on Clinical Laboratory Data Utilizing Machine Learning Use-Case Methodology</article-title>
<alt-title alt-title-type="left-running-head">A Work Review on Clinical Laboratory Data Utilizing Machine Learning Use-Case Methodology</alt-title>
<alt-title alt-title-type="right-running-head">A Work Review on Clinical Laboratory Data Utilizing Machine Learning Use-Case Methodology</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Ramasamy</surname><given-names>Uma</given-names>
</name><email>seen.uma25@gmail.com</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Santhoshkumar</surname><given-names>Sundar</given-names>
</name></contrib>
<aff><institution>Department of Computer Science, Alagappa University, Karaikudi</institution>, <addr-line>Tamil Nadu</addr-line>, <country>India</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Uma Ramasamy. Email: <email>seen.uma25@gmail.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>10</day><month>01</month><year>2024</year></pub-date>
<volume>2</volume>
<issue>0</issue>
<fpage>1</fpage>
<lpage>14</lpage>
<history>
<date date-type="received">
<day>21</day>
<month>10</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>29</day>
<month>11</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 Ramasamy and Santhoshkumar</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Ramasamy and Santhoshkumar</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_JIMH_46995.pdf"></self-uri>
<abstract>
<p>More than 140 autoimmune diseases have distinct autoantibodies and symptoms, and it makes it challenging to construct an appropriate model using Machine Learning (ML) for autoimmune disease. Arthritis-related autoimmunity requires special attention. Although many conventional biomarkers for arthritis have been established, more biomarkers of arthritis autoimmune diseases remain to be identified. This review focuses on the research conducted using data obtained from clinical laboratory testing of real-time arthritis patients. The collected data is labelled the Arthritis Profile Data (APD) dataset. The APD dataset is the retrospective data with many missing values. We undertook a comprehensive APD dataset study comprising four key steps. Initially, we identified suitable imputation techniques for the APD dataset. Subsequently, we conducted a comparative analysis with different benchmark disease datasets. We determined the most effective ML model for the APD dataset. Finally, identified the hidden biomarkers in the APD dataset. We applied various imputation techniques to handle these missing data on the APD dataset, and the best imputation techniques were determined using the degree of proximity (DoP) and degree of residual (DoR) procedure. The random value imputer and mode imputer are the suitable imputation techniques identified. Different benchmark disease datasets were compared using different hold-out (HO) methods and cross-validation (CV) folds, which highlights that the dataset properties significantly impact the performance of ML models. Random Forest (RndF) and XGBoost (XGB) are the best performing ML algorithms for most diseases, with accuracy consistently above 80%. The appropriate ML model for the APD dataset is the XGB (Extreme Gradient Boosting). Moreover, using the XGB feature importance concept significant features were identified for the APD dataset. The substantial and hidden biomarkers identified were Erythrocyte Sedimentation Rate (ESR), Antistreptolysin O (ASO), C-Reactive Protein (CRP), Rheumatoid Factor (RF), Lymphocytes (L), Absolute Eosinophil count (Abs), Uric_Acid, Red Blood Cell count (RBC), and Blood for Total Count (TC).</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Autoimmune diseases</kwd>
<kwd>biomarkers</kwd>
<kwd>arthritis data</kwd>
<kwd>imputation techniques</kwd>
<kwd>machine learning algorithms</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Department of Science and Technology, New Delhi</funding-source>
<award-id>24-51/2014-U</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Rheumatology is a branch of medicine that deals with diagnosing and treating rheumatic disease. Autoimmune diseases, autoinflammatory diseases, crystalline arthritis, metabolic bone diseases, etc., come under the rheumatic disease category [<xref ref-type="bibr" rid="ref-1">1</xref>]. No single blood test can quickly confirm a rheumatic disease diagnosis [<xref ref-type="bibr" rid="ref-2">2</xref>]. ANA (Antinuclear Antibody) is the standard test to identify rheumatic disease. However ANA could be positive in many conditions for an average human. The immune system produces antibodies that react with self-antigens, causing pathology, known as autoantibodies [<xref ref-type="bibr" rid="ref-3">3</xref>]. Autoantibodies cause autoimmune disease [<xref ref-type="bibr" rid="ref-4">4</xref>]. There are 13 subcategories of death in autoimmune diseases. Diabetes, multiple sclerosis, pernicious anaemia, arthritis, and lupus are the five autoimmune diseases that cause the most deaths, from most to least [<xref ref-type="bibr" rid="ref-5">5</xref>]. Regardless of gender, both males and females suffer from many arthritis diseases. Females are affected more than males because of autoimmune arthritis [<xref ref-type="bibr" rid="ref-6">6</xref>]. Though much research has been done on treating autoimmune diseases, finding biomarkers that play a significant role in identifying them is necessary. Exploring the significance of hidden biomarkers in autoimmune disease prognosis is also indispensable. Only a few research studies have been initiated in predicting autoimmune disease using patient data [<xref ref-type="bibr" rid="ref-7">7</xref>]. Discovering hidden biomarkers is the one plausible research that must be addressed from autoimmune patient data records.</p>
<p>A subset of artificial intelligence is ML, where the machine learns from the existing data and decides to solve the problem quickly [<xref ref-type="bibr" rid="ref-8">8</xref>]. Prediction of autoimmune diseases are done using ML algorithms [<xref ref-type="bibr" rid="ref-9">9</xref>]. The arthritis profile patient data was collected from Sri Eswari Lab, Karaikudi, Tamil Nadu, to predict autoimmune arthritis disease and identify significant and hidden biomarkers using ML algorithms.</p>
<p>Over one year, from February 2021 to February 2022, we collected arthritis information from Sri Easwari Computerized Lab, Karaikudi, Tamil Nadu, India. The clinical laboratory provided patient details that consisted of demographic data and Arthritis Profile I investigations data, Arthritis Profile II investigations data, and Arthritis Profile III investigations data. The attributes of the arthritis data category are displayed in <xref ref-type="table" rid="table-1">Table 1</xref>. We named the dataset as &#x2018;Arthritis Profile Data&#x2019; (APD). The APD has 24 attributes and 52 data points. Except for the &#x2018;gender&#x2019; feature, every feature in our dataset has numeric, discrete, and continuous values. &#x2018;Gender&#x2019; and &#x2018;RF&#x2019; are the only attributes with no mislaid data. Empty data points within the APD dataset implies each information point is missing at least one feature value [<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Arthritis data features category</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>S. No.</th>
<th>Category arthritis data</th>
<th>Features</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.</td>
<td>Demographic data</td>
<td>Age and gender</td>
</tr>
<tr>
<td>2.</td>
<td>Arthritis profile I investigation</td>
<td>TC, P, L, E, &#x002A;ESR, Hb, RBC, Abs, PC, PCV, MCV, MCH, and MCHC</td>
</tr>
<tr>
<td>3.</td>
<td>Arthritis profile II investigation</td>
<td>ASO, RF, and CRP</td>
</tr>
<tr>
<td>4.</td>
<td>Arthritis profile III investigation</td>
<td>RBS, Blood urea, Creatinine (serum), Calcium (serum), and Uric acid (serum)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>Note: P-Polymorphs, E-Eosinophils, Hb-Haemoglobin, PC-Platelet Count, PCV-Packed Cell Volume, MCV-Mean Corpuscular Volume, MCH-Mean Corpuscular Hemoglobin, MCHC-Mean Corpuscular Hb Concentration, RBS&#x2013;Random Blood Sugar. &#x002A;ESR for half an hour and one hour.</p></fn></table-wrap-foot>
</table-wrap>
<p>Our work significantly contributes to the domain of clinical laboratory data and machine learning methodologies in the following key aspects:
<list list-type="bullet">
<list-item>
<p>Identifying suitable imputation techniques tailored to the APD dataset enhances data completeness.</p></list-item>
<list-item>
<p>Discerning the selection of a ML model well-aligned with the APD dataset characteristics.</p></list-item>
<list-item>
<p>Other hidden biomarkers have been recognized in addition to the well-known biomarkers.</p></list-item>
<list-item>
<p>It has been identified that the inherent characteristics of the dataset significantly impact the accuracy score.</p></list-item>
<list-item>
<p>The substantial impact of the HO method and CV fold size on model accuracy provides key methodological insights have been identified.</p></list-item>
</list></p>
<p>The remaining review comprises the following sections: The second section describes the generic methodology for ML use cases. The third section discusses acceptable ML imputation approaches for arthritis profile data, while the fourth section compares arthritis profile data with other benchmark datasets and discusses viable ML algorithms for it. <xref ref-type="sec" rid="s5">Section 5</xref> explains the conclusion of our study and recommends future work.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>General Approach for Use Case Using ML Model</title>
<p>The use case exploited for our research work is to assess whether the patient is affected with autoimmune arthritis disease. <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the flow diagram, narrates the ML model for any use case. In our scenario, the initial step is the data collection process [<xref ref-type="bibr" rid="ref-11">11</xref>]. As mentioned earlier, data were collected from a computerized lab. Next is to check whether the collected data is in the proper format to create a ML model. Since the collected information is not in the appropriate form, the next step is data preprocessing.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Flow diagram of use case using machine learning model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JIMH_46995-fig-1.tif"/>
</fig>
<p>In the data preprocessing step, feature engineering and feature selection are performed [<xref ref-type="bibr" rid="ref-12">12</xref>]. Feature Engineering examines each feature and converts it into the proper format so it is in an acceptable input format for the ML algorithm. Handling categorical variables, handling missing values, handling imbalanced datasets, etc., are performed in feature engineering. In feature selection, certain features are removed by checking the correlation or covariance of the predictor and target variables. If the target variable is not correlated with the independent variable, that variable is suggested to be removed.</p>
<p>Since the APD data has missing values, it is handled using imputation techniques in our use case. Moreover, feature selection is performed to reduce the curse of dimensionality. Detailed descriptions of imputation techniques in the APD dataset will be explained in the forthcoming section. According to <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the next step to be followed is modeling after data preprocessing.</p>
<p>Model creation consists of feature engineering, model training, and model evaluation. Normalization is applied to the features. Once the feature values are in the proper format, different ML as well as deep learning techniques can be applied [<xref ref-type="bibr" rid="ref-13">13</xref>]. The suitable ML algorithm for our use case is finally identified by performing model evaluation such as accuracy and confusion matrix and by following CV and hyperparameter optimization [<xref ref-type="bibr" rid="ref-14">14</xref>]. The best suitable ML model for the use case is the ML algorithm that secures the highest score. In our scenario, different HO and CV methods have obtained a suitable ML algorithm for the APD dataset. A detailed description of this study will be discussed in the forthcoming section. The trained model is tested using the test data, and its prediction accuracy is checked. If its accuracy is good, deployment can be done using web services. If not, there may be an issue in the collected data or in data preprocessing, so it is necessary to continue the same process again. The coding was done using Python which is well known for its excellent feature such as robust portability, good interpretability, and strong versatility. The computer hardware specifications used for our experiments are CPU operating at @ 2.40 and 2.42 GHz, 12 GB of memory, and 64-bit operating system.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Assessment of Relevant Imputation Techniques for the APD Dataset</title>
<p><xref ref-type="fig" rid="fig-2">Fig. 2</xref> shows percentage (%) of missing feature values in the APD dataset. The mislaid value was absent in the &#x2018;RF&#x2019; and &#x2018;Gender&#x2019; features. The missing values in each feature were handled using ML imputation techniques. Many researchers have suggested different imputation techniques for their proposed work [<xref ref-type="bibr" rid="ref-15">15</xref>&#x2013;<xref ref-type="bibr" rid="ref-19">19</xref>]. Various ML algorithms reveal different model performance and classification accuracy. In general, each incomplete dataset may vary its ML model performance depending on the different imputation techniques applied to it. Therefore, selecting a pertinent imputation technique for the incomplete record set is necessary.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Percentage of missing values in the APD dataset features</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JIMH_46995-fig-2.tif"/>
</fig>
<p>ML models are commonly employed by researchers to address missing and imputed (imp) data. The findings of these studies can vary depending on several factors, including the record set domain, the type of missing categories (such as Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing not at Random (MNAR)), the pattern of missingness, missingness percentage in the data sample, and the chosen imputation techniques. Further investigation into imputation techniques is consistently required as their efficacy is contingent upon multiple factors. Implementing both ML algorithms and traditional imputation techniques is possible, but discerning which method outperforms the other poses a challenge. Additionally, the assessment of imp data relies on varying metrics in accordance with the researchers investigation.</p>
<p>The seven imp APD datasets were generated using imputation techniques. The imputation techniques that were applied include mean, median, mode (using the first index of the mode value), random value, K-nearest neighbors (KNN), Multiple Imputation by Chained Equation (MICE), and Random Forest (RndF) imputation. Statistical parameters such as arithmetic average (mean), midpoint (median), and standard deviation (SD) were computed for the partial data sample and for all the seven complete imp information sets. By analyzing the statistical parameters, a comparison was made between the distribution of the seven imp datasets and the actual datasets. Furthermore, the DoP was assessed for each of the seven imp datasets to identify the values that closely resemble the values in the incomplete APD dataset. A higher DoP indicates a greater similarity between the imp and original values.</p>
<p>Selecting the most effective imputation techniques depends on the suggested proximity and residual evaluation measures. The overall process for assessing proximity involves determining the extent to which the imp values in the dataset deviate from the original values. This is done by considering statistical properties such as arithmetic average (mean), midpoint (median) and modal value. Similarly, the general procedure for evaluating residual focuses on quantifying the discrepancy between the original and imp values.</p>
<p>The <xref ref-type="fig" rid="fig-3">Fig. 3</xref> procedure depicts the appropriate imputation strategies are determined based on the DoP. The mean imp value, MICE imputer, and random forest dataset are the most effective imputation techniques when utilizing mean statistical properties. Similarly, the median imp value, random value imputer, and random forest dataset are the top three imputation procedures for median statistical properties. Furthermore, the preferred imputation approaches for SD statistical properties are the random value imputer, KNN imputer, and mode imp value.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Generic procedure to analyze the degree of proximity in the dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JIMH_46995-fig-3.tif"/>
</fig>
<p>To evaluate the imp datasets, the Imp Mean Average Error (MAEim), Imp Mean Square Error (MSEim), Imp Root Mean Square Error (RMSEim), and Imp Coefficient of determination or R-Squared (R<sup>2</sup>im) are computed between each imp dataset and the original partial APD dataset (with mislaid values substituted by zero) to each feature.</p>
<p>To assess the attainment of the imp dataset, we utilize three metrics: MAEim, RMSEim, and R<sup>2</sup>im. These metrics help us measure the level of residual present. Our primary objective is to determine the variance (error) between the original and imp values, as this provides insight into the performance of the imp dataset. Once we have calculated these evaluation criteria, the next step is to rank each imp dataset based on its attributes. We then apply the residual technique to each performance metric, where a higher DoR corresponds to a smaller residual in the imp dataset.</p>
<p>Using the method depicted in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, the acceptable imputation strategies are identified based on the DoR. If the MAEim is zero, it indicates that the attribute contains no mislaid values. The imp dataset exhibits a reduced error when utilizing the mode imp value, median imp value, and KNN imputer, indicating higher precision. Conversely, the imp datasets produced by the RMSEim metrics, such as the mode imp value, median imp value, and MICE imputer, exhibit a larger residual, implying lower accuracy. When employing the R<sup>2</sup>im metrics, the imp datasets that demonstrate a larger residual, suggesting greater inaccuracy, are the mode, median, and random value imp.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Generic procedure to analyze the degree of residual in the dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JIMH_46995-fig-4.tif"/>
</fig>
<p>The random value imputer is the most effective in determining proximity among the seven imputation approaches. The median imp value and the MICE imputer closely follow it. Regarding the imp error, the mode imp value, median imp value, and KNN imputer have the lowest values, while the random value imputer and MICE imputer are close behind.</p>
</sec>
<sec id="s4">
<label>4</label>
<title>Analysis of Suitable ML Algorithm for the APD Dataset</title>
<p>Autoimmune arthritis illnesses, such as rheumatoid arthritis (RA), psoriatic arthritis, and juvenile arthritis, fall under the category of rheumatic diseases. Information regarding arthritis profiles was gathered from Sri Easwari Computerized Lab in Karaikudi, Tamil Nadu, India, over one year and six months, from February 2021 to August 2022. The updated APD dataset has 24 attributes and 102 data points. The dataset includes patient information for individuals with and without autoimmune arthritis disease. The subsequent study aims to assess the arthritis dataset using ML algorithms, classification techniques, and ensemble approaches to uncover hidden biomarkers. Additionally, the research aims to compare the arthritis dataset with other benchmark datasets to determine if the dataset&#x2019;s characteristics impact the accuracy of the ML model.</p>
<p>The following benchmark disease datasets were used for comparison with the APD dataset: Wisconsin Breast Cancer (WBC) [<xref ref-type="bibr" rid="ref-20">20</xref>&#x2013;<xref ref-type="bibr" rid="ref-22">22</xref>], cardiovascular disease (CVD) [<xref ref-type="bibr" rid="ref-23">23</xref>&#x2013;<xref ref-type="bibr" rid="ref-27">27</xref>], Pima Indians Diabetes Mellitus (PIMA) [<xref ref-type="bibr" rid="ref-28">28</xref>&#x2013;<xref ref-type="bibr" rid="ref-30">30</xref>], chronic kidney disease (CKD) [<xref ref-type="bibr" rid="ref-31">31</xref>&#x2013;<xref ref-type="bibr" rid="ref-34">34</xref>] and RA dataset [<xref ref-type="bibr" rid="ref-35">35</xref>].</p>
<p>The ML algorithms used on the APD dataset and other benchmark datasets include Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RndF), and Extreme Gradient Boosting [<xref ref-type="bibr" rid="ref-36">36</xref>]. Only the default hyperparameter is used for the execution of these ML models, as depicted in <xref ref-type="table" rid="table-2">Table 2</xref>. Different disease datasets have been implemented on these ML models subjected to various HO and CV techniques to investigate if dataset characteristics influence prediction accuracy. The dataset&#x2019;s essential characteristics are the attribute size, instance size, and categorical and numerical data types.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Default hyperparameter used for the machine learning models</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>S. No.</th>
<th>Machine learning models</th>
<th>Default hyperparameter</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>LR</td>
<td style="background:#FFFFFF;">&#x2018;penalty&#x2019;:&#x2018;l2&#x2019;,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;C&#x2019;:1.0,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;solver&#x2019;:&#x2018;lbfgs&#x2019;</td>
</tr>
<tr>
<td>2</td>
<td>KNN</td>
<td style="background:#FFFFFF;">&#x2018;n_neighbors&#x2019;:5,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;leaf_size&#x2019;:30,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;metric&#x2019;:&#x2018;minkowski&#x2019;</td>
</tr>
<tr>
<td>3</td>
<td>SVM</td>
<td style="background:#FFFFFF;">&#x2018;C&#x2019;:1.0,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;kernel&#x2019;:&#x2018;rbf&#x2019;,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;gamma&#x2019;: &#x2018;scale&#x2019;</td>
</tr>
<tr>
<td>4</td>
<td>RndF</td>
<td style="background:#FFFFFF;">&#x2018;n_estimators&#x2019;:100,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;criterion&#x2019;:&#x2018;gini&#x2019;,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;min_samples_split&#x2019;:2</td>
</tr>
<tr>
<td>5</td>
<td>XGB</td>
<td style="background:#FFFFFF;">&#x2018;max_depth&#x2019;:3,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;objective&#x2019;:&#x2018;binary:logistic&#x2019;,</td>
</tr>
<tr>
<td/>
<td/>
<td style="background:#FFFFFF;">&#x2018;eval_metric&#x2019;: None</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The dataset is initially split into training and test data sets using the HO method. The training data is used to train ML models, while the test data is used to evaluate the trained model. However, relying on a single HO set makes it challenging to assess the representativeness of the training and test data and the overall stability of the model. To overcome this limitation, the datasets were split into training and testing sets in three different proportions: 4:1 (20% test data), 7:3 (30% test data), and 3:2 (40% test data). However, these splits have their own limitations, such as the potential presence of all positive classes in the training dataset, possible dependencies between the training and test datasets, and uneven distribution of data for training and testing.</p>
<p>Cross-validation is used to overcome these constraints. In CV, the dataset is divided into equal-sized sections called folds. One fold is used as the test set, while the remaining k-1 folds are used as the training set. This ensures that each fold is tested at least once, preventing any overlap in the test data. By adjusting the size of the k fold, it is possible to have similar likelihoods of positive or negative classes. Three alternative folds are commonly used to address this: 3-fold CV, 5-fold CV, and 10-fold CV.</p>
<p>The APD, WBC, CVD, PIMA, CKD, and RA datasets have been collected from multiple sources and are intended for use in the proposed work. Once the data has been collected, its format has been validated to ensure that ML methods may be applied. Consequently, all disease datasets have undergone the required preprocessing stages. The next step follows the fundamental processes of feature engineering, such as addressing missing values using imputation methods and categorical data with one-hot encoding. The preprocessed datasets of diseases are utilized in conjunction with ML classification algorithms and ensemble methods. The final outcome is determined by evaluating the accuracy of the predictions and the CV score produced by the machine learning algorithms.</p>
<p>The accuracy values for different disease datasets, obtained through various HO and CV methods, are presented in <xref ref-type="table" rid="table-3">Table 3</xref>. The suitable ML model detected for the APD dataset is the XGB, with accuracies of 90.48%, 87.1% and 87.8% for different HO methods. In contrast, it obtained 95.1%, 97.1% and 97% accuracy for different CV methods, respectively. Different HO and CV methods evaluated classification accuracy for the remaining datasets, such as CKD, CVD, PIMA, RA, WBC, and APD. These classification accuracies are determined for different datasets to discover whether the dataset characteristics influence the ML model&#x2019;s performance.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Accuracy values for different disease datasets using various hold-out and cross-validation methods</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Disease dataset</th>
<th rowspan="2">ML algorithm</th>
<th colspan="3">Hold-out method (Accuracy in %)</th>
<th colspan="3">Cross&#x2013;validation (Accuracy in %)</th>
</tr>
<tr>
<th>20% of test data</th>
<th>30% of test data</th>
<th>40% of test data</th>
<th>3 Folds</th>
<th>5 Folds</th>
<th>10 Folds</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" rowspan="5">APD</td>
<td>LR</td>
<td>57.14</td>
<td>58.06</td>
<td>65.85</td>
<td>70.59</td>
<td>67.48</td>
<td>66.55</td>
</tr>
<tr>
<td>KNN</td>
<td>66.67</td>
<td>58.06</td>
<td>65.85</td>
<td>62.75</td>
<td>60.71</td>
<td>59.82</td>
</tr>
<tr>
<td>SVM</td>
<td>52.38</td>
<td>45.16</td>
<td>46.34</td>
<td>50.98</td>
<td>52.9</td>
<td>52</td>
</tr>
<tr>
<td>RndF</td>
<td>71.43</td>
<td>74.19</td>
<td>80.49</td>
<td>86.27</td>
<td>88.19</td>
<td>88.36</td>
</tr>
<tr>
<td>XGB</td>
<td><bold>90.48</bold></td>
<td><bold>87.1</bold></td>
<td><bold>87.8</bold></td>
<td><bold>95.1</bold></td>
<td><bold>97.1</bold></td>
<td><bold>97</bold></td>
</tr>
<tr>
<td align="center" rowspan="5">CKD</td>
<td>LR</td>
<td>87.5</td>
<td>89.17</td>
<td>86.25</td>
<td>87.25</td>
<td>87.75</td>
<td>88.75</td>
</tr>
<tr>
<td>KNN</td>
<td>63.75</td>
<td>62.5</td>
<td>60</td>
<td>59</td>
<td>63.25</td>
<td>61.75</td>
</tr>
<tr>
<td>SVM</td>
<td>60</td>
<td>58.33</td>
<td>58.13</td>
<td>62.5</td>
<td>62.5</td>
<td>62.5</td>
</tr>
<tr>
<td>RndF</td>
<td><bold>95</bold></td>
<td><bold>95.83</bold></td>
<td><bold>96.88</bold></td>
<td><bold>98.25</bold></td>
<td><bold>97.75</bold></td>
<td><bold>98</bold></td>
</tr>
<tr>
<td>XGB</td>
<td>93.75</td>
<td>95</td>
<td>94.38</td>
<td>96.25</td>
<td>97</td>
<td>96.75</td>
</tr>
<tr>
<td align="center" rowspan="5">CVD</td>
<td>LR</td>
<td>72.19</td>
<td>71.77</td>
<td>71.74</td>
<td>71.71</td>
<td>71.79</td>
<td>71.9</td>
</tr>
<tr>
<td>KNN</td>
<td>69.25</td>
<td>68.45</td>
<td>68.64</td>
<td>68.99</td>
<td>68.81</td>
<td>68.84</td>
</tr>
<tr>
<td>SVM</td>
<td>72.23</td>
<td>71.71</td>
<td>71.86</td>
<td>71.83</td>
<td>71.86</td>
<td>71.86</td>
</tr>
<tr>
<td>RndF</td>
<td>71.11</td>
<td>71</td>
<td>71.2</td>
<td>71.19</td>
<td>71.02</td>
<td>70.9</td>
</tr>
<tr>
<td>XGB</td>
<td><bold>73.69</bold></td>
<td><bold>73.17</bold></td>
<td><bold>73</bold></td>
<td><bold>73.07</bold></td>
<td><bold>73.21</bold></td>
<td><bold>73.34</bold></td>
</tr>
<tr>
<td align="center" rowspan="5">PIMA</td>
<td>LR</td>
<td><bold>77.92</bold></td>
<td><bold>77.92</bold></td>
<td><bold>76.95</bold></td>
<td><bold>77.47</bold></td>
<td><bold>78</bold></td>
<td><bold>77.09</bold></td>
</tr>
<tr>
<td>KNN</td>
<td>68.83</td>
<td>67.97</td>
<td>66.88</td>
<td>70.96</td>
<td>72.14</td>
<td>72.52</td>
</tr>
<tr>
<td>SVM</td>
<td>75.97</td>
<td>71.43</td>
<td>74.03</td>
<td>75.13</td>
<td>75.65</td>
<td>75.65</td>
</tr>
<tr>
<td>RndF</td>
<td>75.32</td>
<td>76.19</td>
<td>72.4</td>
<td>76.95</td>
<td>76.7</td>
<td>74.48</td>
</tr>
<tr>
<td>XGB</td>
<td>75.32</td>
<td>70.13</td>
<td>71.1</td>
<td>76.17</td>
<td>74.23</td>
<td>73.05</td>
</tr>
<tr>
<td align="center" rowspan="5">RA</td>
<td>LR</td>
<td>66.67</td>
<td>72.22</td>
<td>75</td>
<td><bold>61.67</bold></td>
<td><bold>76.67</bold></td>
<td><bold>80</bold></td>
</tr>
<tr>
<td>KNN</td>
<td><bold>83.33</bold></td>
<td><bold>83.33</bold></td>
<td><bold>87.5</bold></td>
<td>56.67</td>
<td>61.67</td>
<td>66.67</td>
</tr>
<tr>
<td>SVM</td>
<td>75</td>
<td>61.11</td>
<td>79.17</td>
<td>58.33</td>
<td>60</td>
<td>66.67</td>
</tr>
<tr>
<td>RndF</td>
<td>75</td>
<td>72.22</td>
<td>75</td>
<td>55</td>
<td>65</td>
<td>70</td>
</tr>
<tr>
<td>XGB</td>
<td>66.67</td>
<td>72.22</td>
<td>62.5</td>
<td>58.33</td>
<td>68.33</td>
<td>73.33</td>
</tr>
<tr>
<td align="center" rowspan="5">WBC</td>
<td>LR</td>
<td>98.25</td>
<td>96.49</td>
<td>96.05</td>
<td>94.55</td>
<td>93.85</td>
<td>94.38</td>
</tr>
<tr>
<td>KNN</td>
<td>96.49</td>
<td>92.98</td>
<td>93.86</td>
<td>92.27</td>
<td>92.79</td>
<td>92.98</td>
</tr>
<tr>
<td>SVM</td>
<td>92.98</td>
<td>90.06</td>
<td>91.67</td>
<td>91.04</td>
<td>91.22</td>
<td>91.39</td>
</tr>
<tr>
<td>RndF</td>
<td><bold>99.12</bold></td>
<td><bold>98.25</bold></td>
<td><bold>97.81</bold></td>
<td>95.96</td>
<td>95.78</td>
<td>95.61</td>
</tr>
<tr>
<td>XGB</td>
<td>96.99</td>
<td>97.08</td>
<td>97.81</td>
<td><bold>96.66</bold></td>
<td><bold>97.72</bold></td>
<td><bold>97.89</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>A well-known autoimmune disease is diabetes. The PIMA dataset depicts logistic regression as its efficient ML model because of its linear relationship among the attributes. It consists of 768 instances with eight predictor variables and one outcome variable. LR achieved the highest accuracy compared with other ML models in the PIMA dataset, with accuracy above 75% for all HO methods and CV folds. The K-nearest neighbors (KNN) algorithm performs effectively when applied to datasets that are small in size and have a limited number of features. The RA dataset has 60 data points with eight independent and one dependent variable. The KNN algorithm shows the utmost accuracy above 80% among the various HO methods employed for the RA dataset. At the same time, the accuracy in all CV increases as the number of folds increases in the KNN ML model for the RA dataset.</p>
<p>The WBC dataset has 569 instances with 30 predictor variables and one response variable. RndF achieves the highest accuracy percentage above 97% for all different HO methods. Similarly, XGB shows the highest model performance concerning different CV methods with accuracies of above 96%. The CKD dataset has 400 instances with 24 predictor variables and one response variable. It depicts the highest accuracy, above 95%, in both HO and CV methods. The XGB ensemble technique is the suitable ML model for the CVD dataset with accuracy above 73% for both HO and CV methods, with 70000 samples with 12 independent variables and one dependent variable.</p>
<p>The dataset&#x2019;s characteristics greatly influence the accuracy score. A prime example is the WBC (breast cancer dataset), which consists of over 500 data points, more than 30 attributes, and no missing values, all of which are continuous numeric attribute values. Consequently, it has been recognized to have the highest accuracy score among other datasets because of its dataset characteristics.</p>
<p>The analysis of the autoimmune arthritis disease dataset identifies the ML model most suited for predicting new patient data, regardless of whether the patient has autoimmune arthritis illness. Similarly, the study discovers biomarkers that are concealed inside the arthritis dataset. In addition, a comparison study of the arthritis dataset and a significant number of benchmark datasets is presented to determine if the dataset&#x2019;s properties influence the accuracy of the ML model [<xref ref-type="bibr" rid="ref-37">37</xref>]. Arthritis Profile Data identifies ESR, ASO, CRP, RF, L, Abs, Uric Acid, RBC, and TC as significant biomarkers. The researcher has identified significant biomarkers for the autoimmune arthritis data, such as CRP, ESR, and RF [<xref ref-type="bibr" rid="ref-38">38</xref>]. Apart from these important biomarkers, the significant hidden biomarkers discovered are ASO [<xref ref-type="bibr" rid="ref-39">39</xref>], L, Abs, Uric Acid, RBC, and TC. Our empirical evidence demonstrates unambiguously that the XGB ensemble approach provides the best level of accuracy for Arthritis Profile Data.</p>
<p>Among the five ML models, XGB scored the highest accuracy using different HO and CV methods for the APD dataset. It is mandatory to find important features in any dataset [<xref ref-type="bibr" rid="ref-40">40</xref>]. Feature importance values were calculated using the XGB, and the top ten features are displayed in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. The first five highest feature importance values are 0.32, 0.27, 0.12, 0.10, and 0.03 for the following features: ESRo, ASO, CRP, RF, and L. ESRo feature holds the highest feature importance value.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Significant biomarkers identified in the APD dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JIMH_46995-fig-5.tif"/>
</fig>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion and Future Scope</title>
<p>Analyzing real-time clinical laboratory data is a crucial task. The APD dataset is the retrospective clinical data collected from the laboratory. The missing data were handled with suitable imputation techniques in the APD dataset. The random and mode value imputer has been identified as the optimal ML imputation algorithms for the APD dataset based on the DoP and residual. The ML algorithm that is most suitable for the APD dataset is XGB. Thus, XGB is an effective ensemble technique for small and high-dimensional datasets. The highest accuracy obtained in the APD dataset is 90.48% for the HO method, which consists of 80% of training data and 20% of test data.Moreover, using five folds CV, it obtained an accuracy of 97.1%. Out of 24 occurrences of the APD dataset, only six significant hidden biomarkers were identified: ASO, L, Abs, Uric_Acid, RBC, and TC. Our empirical investigation demonstrates that dataset properties significantly impact the performance of ML models. Furthermore, the HO method and CV fold size also considerably affect the accuracy of the ML algorithms. Using a larger HO test set or more CV folds generally leads to lower accuracy, as the algorithms can overfit the training data. Finally, as the future scope, we have decided to optimize the accuracy of the XGB using the metaheuristic optimization algorithms.</p>
</sec>
</body>
<back>
<ack>
<p>We extend our sincere gratitude to the Editor of the Journal of Intelligent Medicine and Healthcare, for their invaluable guidance. We would like to thank anonymous reviewers whose insights strengthened the quality of the manuscript.</p>
</ack>
<sec><title>Funding Statement</title>
<p>Department of Science and Technology, New Delhi for the financial support in general and infrastructure facilities sponsored under PURSE 2<sup>nd</sup> Phase Programme (Order No. SR/PURSE Phase 2/38 (G) dated: 21.02.2017). This work is supported by RUSA Phase 2.0 (II Installment) Grant Sanctioned Vide Letter No. F. 24-51/2014-U, Policy (TN Multi-Gen), Department of Higher Education, Government of India, Dt. 09.10.2018.</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm their contribution to the paper as follows: study conception and design, data collection, analysis, and interpretation of results: R. Uma, draft manuscript preparation: S. Santhoshkumar. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>Kidney Disease Dataset-<ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/mansoordaku/ckdisease">https://www.kaggle.com/datasets/mansoordaku/ckdisease</ext-link>. Cardiovascular Disease Dataset-<ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/code/sakakafayat/cardiovascular-disease-dataset/data">https://www.kaggle.com/code/sakakafayat/cardiovascular-disease-dataset/data</ext-link>. Breast Cancer Dataset-<ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data">https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data</ext-link>. Diabetes dataset-<ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database">https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database</ext-link>. RA Dataset: Data is available on request from the authors. APD Dataset: Data is available on request from the authors.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>1.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Runge</surname></string-name></person-group>, &#x201C;<article-title>Rheumatology fellowship curriculum</article-title>,&#x201D; <comment>Ph.D. dissertation, Upstate Medical University, USA</comment>, <year>2001</year>.</mixed-citation></ref>
<ref id="ref-2"><label>2.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Bossuyt</surname></string-name>, <string-name><given-names>E.</given-names> <surname>de Langhe</surname></string-name>, <string-name><given-names>M. O.</given-names> <surname>Borghi</surname></string-name> and <string-name><given-names>P. L.</given-names> <surname>Meroni</surname></string-name></person-group>, &#x201C;<article-title>Understanding and interpreting antinuclear antibody tests in systemic rheumatic diseases</article-title>,&#x201D; <source>Nature Reviews Rheumatology</source>, vol. <volume>16</volume>, no. <issue>12</issue>, pp. <fpage>715</fpage>&#x2013;<lpage>726</lpage>, <year>2020</year>; <pub-id pub-id-type="pmid">33154583</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>3.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. A.</given-names> <surname>van Delft</surname></string-name> and <string-name><given-names>T. W.</given-names> <surname>Huizinga</surname></string-name></person-group>, &#x201C;<article-title>An overview of autoantibodies in rheumatoid arthritis</article-title>,&#x201D; <source>Journal of Autoimmunity</source>, vol. <volume>110</volume>, pp. <fpage>102392</fpage>, <year>2020</year>; <pub-id pub-id-type="pmid">31911013</pub-id></mixed-citation></ref>
<ref id="ref-4"><label>4.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z. X.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>J. S.</given-names> <surname>Miller</surname></string-name> and <string-name><given-names>S. G.</given-names> <surname>Zheng</surname></string-name></person-group>, &#x201C;<article-title>An updated advance of autoantibodies in autoimmune diseases</article-title>,&#x201D; <source>Autoimmunity Reviews</source>, vol. <volume>20</volume>, no. <issue>2</issue>, pp. <fpage>102743</fpage>, <year>2021</year>; <pub-id pub-id-type="pmid">33333232</pub-id></mixed-citation></ref>
<ref id="ref-5"><label>5.</label><mixed-citation publication-type="other">&#x201C;<article-title>Category: Deaths from autoimmune disease</article-title>,&#x201D; <comment>Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc., [Online]</comment>. Available: <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/wiki/Category:Deaths_from_autoimmune_disease">https://en.wikipedia.org/wiki/Category:Deaths_from_autoimmune_disease</ext-link> (<comment>accessed on 10/12/2022</comment>)</mixed-citation></ref>
<ref id="ref-6"><label>6.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. G.</given-names> <surname>Chancay</surname></string-name>, <string-name><given-names>S. N.</given-names> <surname>Guendsechadze</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Blanco</surname></string-name></person-group>, &#x201C;<article-title>Types of pain and their psychosocial impact in women with rheumatoid arthritis</article-title>,&#x201D; <source>Women&#x2019;s Midlife Health</source>, vol. <volume>5</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>9</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-7"><label>7.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. M.</given-names> <surname>Seong</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yee</surname></string-name> and <string-name><given-names>H. S.</given-names> <surname>Gwak</surname></string-name></person-group>, &#x201C;<article-title>Dipeptidyl peptidase-4 inhibitors lower the risk of autoimmune disease in patients with type 2 diabetes mellitus: A nationwide population-based cohort study</article-title>,&#x201D; <source>British Journal of Clinical Pharmacology</source>, vol. <volume>85</volume>, no. <issue>8</issue>, pp. <fpage>1719</fpage>&#x2013;<lpage>1727</lpage>, <year>2019</year>; <pub-id pub-id-type="pmid">30964554</pub-id></mixed-citation></ref>
<ref id="ref-8"><label>8.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Deepack</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Ishmeet</surname></string-name></person-group>, &#x201C;<article-title>Artificial intelligence, machine learning and deep learning: Definitions and differences</article-title>,&#x201D; <source>Clinical and Experimental Dermatology</source>, vol. <volume>45</volume>, no. <issue>1</issue>, pp. <fpage>131</fpage>&#x2013;<lpage>132</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-9"><label>9.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>I. S.</given-names> <surname>Stafford</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Kellermann</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Mossotto</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Mark Beattie</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Ben</surname></string-name> <etal>et al.</etal></person-group> &#x201C;<article-title>A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases</article-title>,&#x201D; <source>npj Digital Medicine</source>, vol. <volume>3</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>11</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-10"><label>10.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Uma</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Santhoshkumar</surname></string-name></person-group>, &#x201C;<article-title>Analysis of suitable machine learning imputation techniques for arthritis profile data</article-title>,&#x201D; <source>IETE Journal of Research</source>, pp. <fpage>1</fpage>&#x2013;<lpage>22</lpage>, <year>2022</year>. <pub-id pub-id-type="doi">10.1080/03772063.2022.2120914</pub-id></mixed-citation></ref>
<ref id="ref-11"><label>11.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Wang</surname></string-name> <etal>et al.</etal></person-group> &#x201C;<article-title>Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19</article-title>,&#x201D; <source>IEEE Reviews in Biomedical Engineering</source>, vol. <volume>14</volume>, pp. <fpage>4</fpage>&#x2013;<lpage>15</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-12"><label>12.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Zebari</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Abdulazeez</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Zeebaree</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Zebari</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Saeed</surname></string-name></person-group>, &#x201C;<article-title>A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction</article-title>,&#x201D; <source>Journal of Applied Science and Technology Trends</source>, vol. <volume>1</volume>, no. <issue>2</issue>, pp. <fpage>56</fpage>&#x2013;<lpage>70</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-13"><label>13.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Fan</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Song</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Deep learning-based feature engineering methods for improved building energy prediction</article-title>,&#x201D; <source>Applied Energy</source>, vol. <volume>240</volume>, pp. <fpage>35</fpage>&#x2013;<lpage>45</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-14"><label>14.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. A. H.</given-names> <surname>Abas</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Ismail</surname></string-name>, <string-name><given-names>N. A.</given-names> <surname>Ali</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Tajuddin</surname></string-name> and <string-name><given-names>N. M.</given-names> <surname>Tahir</surname></string-name></person-group>, &#x201C;<article-title>Agarwood oil quality classification using support vector classifier and grid search cross validation hyperparameter tuning</article-title>,&#x201D; <source>International Journal of Emerging Trends in Engineering Research</source>, vol. <volume>8</volume>, no. <issue>6</issue>, pp. <fpage>2551</fpage>&#x2013;<lpage>2556</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-15"><label>15.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Seu</surname></string-name>, <string-name><given-names>M. S.</given-names> <surname>Kang</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>An intelligent missing data imputation techniques: A review</article-title>,&#x201D; <source>JOIV: International Journal on Informatics Visualization</source>, vol. <volume>6</volume>, no. <issue>1&#x2013;2</issue>, pp. <fpage>278</fpage>&#x2013;<lpage>283</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-16"><label>16.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Poulos</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Valle</surname></string-name></person-group>, &#x201C;<article-title>Missing data imputation for supervised learning</article-title>,&#x201D; <source>Applied Artificial Intelligence</source>, vol. <volume>32</volume>, no. <issue>2</issue>, pp. <fpage>186</fpage>&#x2013;<lpage>196</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-17"><label>17.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Johny</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Philip</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Augustine</surname></string-name></person-group>, &#x201C;<article-title>Methods to handle incomplete data</article-title>,&#x201D; <source>MAMC Journal of Medical Sciences</source>, vol. <volume>6</volume>, no. <issue>3</issue>, pp. <fpage>194</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-18"><label>18.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Woznica</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Biecek</surname></string-name></person-group>, &#x201C;<article-title>Does imputation matter? Benchmark for predictive models</article-title>,&#x201D; <comment>arXiv preprint arXiv:2007.02837</comment>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-19"><label>19.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Sundararajan</surname></string-name> and <string-name><given-names>A. I.</given-names> <surname>Sarwat</surname></string-name></person-group>, &#x201C;<article-title>Evaluation of missing data imputation methods for an enhanced distributed PV generation prediction</article-title>,&#x201D; in <conf-name>Proc. of the Future Technologies Conf.</conf-name>, <publisher-name>Springer International Publishing</publisher-name>, pp. <fpage>590</fpage>&#x2013;<lpage>609</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-20"><label>20.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Rasool</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Bunterngchit</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Tiejian</surname></string-name>, <string-name><given-names>M. R.</given-names> <surname>Islam</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Qu</surname></string-name> <etal>et al.</etal></person-group> &#x201C;<article-title>Improved machine learning-based predictive models for breast cancer diagnosis</article-title>,&#x201D; <source>International Journal of Environmental Research and Public Health</source>, vol. <volume>19</volume>, no. <issue>6</issue>, pp. <fpage>3211</fpage>, <year>2022</year>; <pub-id pub-id-type="pmid">35328897</pub-id></mixed-citation></ref>
<ref id="ref-21"><label>21.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Shinde</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kenchappagol</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Mishra</surname></string-name></person-group>, &#x201C;<article-title>Comparative study of machine learning algorithms for breast cancer classification</article-title>,&#x201D; <source>Intelligent Cloud Computing Smart Innovation Systems and Technologies</source>, vol. <volume>286</volume>, pp. <fpage>545</fpage>&#x2013;<lpage>554</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-22"><label>22.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Mushtaq</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Yaqub</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Sani</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Khalid</surname></string-name></person-group>, &#x201C;<article-title>Effective K-nearest neighbor classifications for Wisconsin breast cancer data sets</article-title>,&#x201D; <source>Journal of the Chinese Institute of Engineers</source>, vol. <volume>43</volume>, no. <issue>1</issue>, pp. <fpage>80</fpage>&#x2013;<lpage>92</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-23"><label>23.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Shorewala</surname></string-name></person-group>, &#x201C;<article-title>Early detection of coronary heart disease using ensemble techniques</article-title>,&#x201D; <source>Informatics in Medicine Unlocked</source>, vol. <volume>26</volume>, pp. <fpage>100655</fpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-24"><label>24.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. N.</given-names> <surname>Uddin</surname></string-name> and <string-name><given-names>R. K.</given-names> <surname>Haider</surname></string-name></person-group>, &#x201C;<article-title>An ensemble method based multilayer dynamic systme to predict cardiovascular disease using machine learning approach</article-title>,&#x201D; <source>Informatics in Medicine Unlocked</source>, vol. <volume>24</volume>, no. <issue>1</issue>, pp. <fpage>100584</fpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-25"><label>25.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Hagan</surname></string-name>, <string-name><given-names>C. J.</given-names> <surname>Gillan</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Mallett</surname></string-name></person-group>, &#x201C;<article-title>Comparison of machine learning methods for the classification of cardiovascular disease</article-title>,&#x201D; <source>Informatics in Medicine Unlocked</source>, vol. <volume>24</volume>, pp. <fpage>100606</fpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-26"><label>26.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N. A.</given-names> <surname>Baghdadi</surname></string-name>, <string-name><given-names>S. M.</given-names> <surname>Farghaly Abdelaliem</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Malki</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Gad</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Ewis</surname></string-name> <etal>et al.</etal></person-group> &#x201C;<article-title>Advanced machine learning techniques for cardiovascular disease early detection and diagnosis</article-title>,&#x201D; <source>Journal of Big Data</source>, vol. <volume>10</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>29</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-27"><label>27.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>U.</given-names> <surname>Banu</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Vanjerkhede</surname></string-name></person-group>, &#x201C;<article-title>Hybrid feature extraction and infinite feature selection based diagnosis for cardiovascular disease related to smoking habit</article-title>,&#x201D; <source>International Journal on Advanced Science, Engineering &#x0026; Information Technology</source>, vol. <volume>13</volume>, no. <issue>2</issue>, pp. <fpage>578</fpage>&#x2013;<lpage>584</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-28"><label>28.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B. P.</given-names> <surname>Kumar</surname></string-name></person-group>, &#x201C;<article-title>Diabetes predictiion and comparative analysis using machine learning algorithms</article-title>,&#x201D; <source>International Research Journal of Modernization in Engineering Technology and Science</source>, vol. <volume>4</volume>, no. <issue>5</issue>, pp. <fpage>4688</fpage>&#x2013;<lpage>4696</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-29"><label>29.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Bailey</surname></string-name>, <string-name><given-names>Q. A.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms</article-title>,&#x201D; <source>Neural Computing Applications</source>, vol. <volume>35</volume>, no. <issue>22</issue>, pp. <fpage>16157</fpage>&#x2013;<lpage>16173</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-30"><label>30.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Elias</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Maria</surname></string-name></person-group>, &#x201C;<article-title>Data-driven machine-learning methods for diabetes risk prediction</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>22</volume>, no. <issue>14</issue>, pp. <fpage>5304</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-31"><label>31.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E. M.</given-names> <surname>Senan</surname></string-name>, <string-name><given-names>M. H.</given-names> <surname>AI-Adhaileh</surname></string-name>, <string-name><given-names>F. W.</given-names> <surname>Alsaade</surname></string-name>, <string-name><given-names>T. H.</given-names> <surname>Aidhyani</surname></string-name>, <string-name><given-names>A. A.</given-names> <surname>Alqarni</surname></string-name> <etal>et al.</etal></person-group> &#x201C;<article-title>Diagnosis of disease using effective classification algorithms and recursive feature elimination techniques</article-title>,&#x201D; <source>Journal of Healthcare Engineering</source>, vol. <volume>2021</volume>, pp. <fpage>1004767</fpage>, <year>2021</year>; <pub-id pub-id-type="pmid">34211680</pub-id></mixed-citation></ref>
<ref id="ref-32"><label>32.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Chaurasia</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Pandey</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Pal</surname></string-name></person-group>, &#x201C;<article-title>Chronic kidney disease: A prediction and comparison of ensemble and basic classifiers performance</article-title>,&#x201D; <source>Human-Intelligent Systems Integration</source>, vol. <volume>4</volume>, no. <issue>1&#x2013;2</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>10</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-33"><label>33.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Tekale</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Shingavi</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Wandhekar</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Chatorikar</surname></string-name></person-group>, &#x201C;<article-title>Prediction of chronic kidney disease using machine learning algorithm</article-title>,&#x201D; <source>International Journal of Advanced Research in Computer and Communication Engineering</source>, vol. <volume>7</volume>, no. <issue>10</issue>, pp. <fpage>92</fpage>&#x2013;<lpage>96</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-34"><label>34.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Majid</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Gulzar</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ayoub</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Khan</surname></string-name></person-group>, &#x201C;<article-title>Using ensemble learning and advanced data mining techniques to improve the diagnosis of chronic kidney disease</article-title>,&#x201D; <source>International Journal of Advanced Computer Science and Applications</source>, vol. <volume>14</volume>, no. <issue>10</issue>, pp. <fpage>470</fpage>&#x2013;<lpage>480</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-35"><label>35.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>U.</given-names> <surname>Ramasamy</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Sundar</surname></string-name></person-group>, &#x201C;<article-title>An illustration of rheumatoid arthritis disease using decision tree algorithm</article-title>,&#x201D; <source>Informatica</source>, vol. <volume>46</volume>, no. <issue>1</issue>, pp. <fpage>109</fpage>&#x2013;<lpage>119</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-36"><label>36.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Xin</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>A diabetes prediction model based on Boruta feature selection and ensemble learning</article-title>,&#x201D; <source>BMC Bioinformatics</source>, vol. <volume>24</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>34</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-37"><label>37.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Al-Tawil</surname></string-name>, <string-name><given-names>B. A.</given-names> <surname>Mahafzah</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Al Tawil</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Aljarah</surname></string-name></person-group>, &#x201C;<article-title>Bio-inspired machine learning approach to type 2 diabetes detection</article-title>,&#x201D; <source>Symmetry</source>, vol. <volume>15</volume>, no. <issue>3</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>16</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-38"><label>38.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Aletaha</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Neogi</surname></string-name>, <string-name><given-names>A. J.</given-names> <surname>Silman</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Funovits</surname></string-name>, <string-name><given-names>D. T.</given-names> <surname>Felson</surname></string-name> <etal>et al.</etal></person-group> &#x201C;<article-title>2010 Rheumatoid arthritis classification criteria: An American College of Rheumatology/European League Against Rheumatism collaborative initiative</article-title>,&#x201D; <source>Arthritis &#x0026; Rheumatism</source>, vol. <volume>62</volume>, no. <issue>9</issue>, pp. <fpage>2569</fpage>&#x2013;<lpage>2581</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-39"><label>39.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R. J.</given-names> <surname>Fran&#x00E7;ois</surname></string-name></person-group>, &#x201C;<article-title>Beta-haemolytic streptococci and antistreptolysin-O titres in patients with rheumatoid arthritis and a matched control group</article-title>,&#x201D; <source>Annals of the Rheumatic Diseases</source>, vol. <volume>24</volume>, no. <issue>4</issue>, pp. <fpage>369</fpage>, <year>1965</year>.</mixed-citation></ref>
<ref id="ref-40"><label>40.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Asa</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Busi</surname></string-name> and <string-name><given-names>J. S.</given-names> <surname>Meka</surname></string-name></person-group>, &#x201C;<article-title>A hybrid deep learning technique for feature selection and classification of chronic kidney disease</article-title>,&#x201D; <source>International Journal of Intelligent Engineering &#x0026; Systems</source>, vol. <volume>16</volume>, no. <issue>6</issue>, pp. <fpage>638</fpage>&#x2013;<lpage>648</lpage>, <year>2023</year>.</mixed-citation></ref>
</ref-list>
</back></article>