Diabetes is a chronic health condition that impairs the body's ability to convert food to energy, recognized by persistently high levels of blood glucose. Undiagnosed diabetes can cause many complications, including retinopathy, nephropathy, neuropathy, and other vascular disorders. Machine learning methods can be very useful for disease identification, prediction, and treatment. This paper proposes a new ensemble learning approach for type 2 diabetes prediction based on a hybrid meta-classifier of fuzzy clustering and logistic regression. The proposed approach consists of two levels. First, a base-learner comprising six machine learning algorithms is utilized for predicting diabetes. Second, a hybrid meta-learner that combines fuzzy clustering and logistic regression is employed to appropriately integrate predictions from the base-learners and provide an accurate prediction of diabetes. The hybrid meta-learner employs the Fuzzy C-means Clustering (FCM) algorithm to generate highly significant clusters of predictions from base-learners. The predictions of base-learners and their fuzzy clusters are then employed as inputs to the Logistic Regression (LR) algorithm, which generates the final diabetes prediction result. Experiments were conducted using two publicly available datasets, the Pima Indians Diabetes Database (PIDD) and the Schorling Diabetes Dataset (SDD) to demonstrate the efficacy of the proposed method for predicting diabetes. When compared with other models, the proposed approach outperformed them and obtained the highest prediction accuracies of 99.00% and 95.20% using the PIDD and SDD datasets, respectively.

Diabetes Mellitus (DM) is a common chronic disease that affects approximately 425 million people worldwide, and this figure is expected to rise to 629 million by 2045 [

Over the last few years, there has been increasing interest in the use of machine learning techniques for the early diagnosis of diabetes to improve classification accuracy. Many classification methods based on machine learning, including Naive Bayes (NB), Neural Networks (NN), Support Vector Machines (SVM), Fuzzy Decision Trees (FDT), K-Nearest Neighbor (K-NN) and Decision Trees (DT), have been used for the diagnosis of diabetes [

Although machine learning approaches for diabetes prediction have been developed recently, existing models rely heavily on a single classifier trained on a single dataset, which is incapable of accurately predicting diabetes. Therefore, this research proposes an ensemble approach based on the principles of the stack method, which combines multiple base models for sample prediction and one meta-model for achieving the ultimate predictions by combining the base-level classifiers. The main goal of this study is to explore whether integrating fuzzy clustering and logistic regression at the meta-learner level improves ensemble stacking efficiency. The proposed ensemble approach utilizes a combination of fuzzy clustering and logistic regression to create a highly accurate diabetes model. The hybrid meta-learner employs the Fuzzy C-means clustering (FCM) algorithm to generate highly significant clusters of predictions from base-learners. The predictions of the base-learners and their fuzzy clusters are then employed as inputs to a logistic regression (LR) algorithm, which generates the final diabetes prediction result. The base-learner model includes six different machine learning algorithms: KNN, radial SVM, NB, linear SVM, NN, and light gradient boosting machine (LightGBM). The proposed approach can be leveraged on health care systems to enhance the accuracy of disease prediction.

Diabetes prediction experiments were used to assess the proposed ensemble method based on five performance measures: accuracy, precision, recall, f1-score, and area under the receiver operating characteristic curve (AUC). The main contributions of this study and its significance for improving type 2 diabetes prediction can be summarized as follows:

Based on the excellent performance of ensemble learning, a hybrid ensemble approach for diabetes prediction is suggested, which can integrate the base learners to create a more efficient learner.

The proposed approach employs a hybrid meta-learner that combines fuzzy clustering and logistic regression to integrate base-learner results appropriately and provide an accurate diabetes prediction.

In order to improve the accuracy of diabetes prediction, the suggested approach uses the fuzzy clustering method to explore the hidden information in base-learner predictions.

The proposed approach outperformed the individual classifiers, soft and hard voting ensemble approaches, and achieved the highest diabetes predictive accuracy of 99% and 95% using two publicly available datasets.

When compared to experimental results from other researchers utilizing the same datasets, the proposed approach achieved superior results.

The remainder of this paper is organized as follows. Section 2 introduces studies related to ensemble learning approaches for diabetes prediction. Section 3 describes the proposed ensemble learning approach for diabetes prediction, which is based on the hybrid meta-classifier of fuzzy clustering and logistic regression. Experimental results and discussion are provided in Section 4. Finally, conclusions and future work are presented in Section 5.

Several ensemble learning techniques for predicting diabetes have recently been formulated. These techniques have received significant attention because they are more efficient than individual learners in achieving high classification accuracy and generalization capacity. Ensemble learning constructs a model from multiple classifiers, which are combined to create a stronger model with better prediction efficiency [

Zolfaghari [

In Singh et al. [

Most existing stacking-based research has a significant issue with the improper integration of the base- and meta-learners. To address this limitation, we proposed a hybrid level-1 meta-learner that integrates fuzzy clustering and logistic regression to classify the predictions from level-0 base-learners. As a result, this research presents a stacking-based ensemble learning approach for diabetes prediction, in which the hybrid meta-learner was used for model combination.

The ensemble learning model is a well-known approach for improving performance by combining a group of classifiers [

The overall process flow of the proposed approach is shown in

Step 1: The dataset is preprocessed by removing the lost and incorrect values due to errors or deregulation, and then it is then divided into training and testing sets using the 5-fold cross-validation technique.

Step 2: The base learners (KNN, radial SVM, NB, linear SVM, NN and LightGBM) are trained and tested using the training and testing sets.

Step 3: FCM algorithm is utilized to cluster the prediction probabilities generated by the base learners.

Step 4: The Logistic Regression algorithm takes the prediction probabilities generated by the base learners and their clusters produced by the FCM as inputs, and provides the final diabetes prediction result.

The proposed approach was implemented using the PIDD with type 2 diabetes from the University of California Irvine repository. Each dataset instance consists of class and eight features; the class indicates whether the patient has diabetes or not: the value of class “1” characterizes diabetic cases, and “0” characterizes nondiabetic cases. The features used in the dataset are the number pregnancies, plasma glucose concentration, diabetes pedigree function, triceps skin fold thickness, diastolic blood pressure, 2-h serum insulin, body max index, and age [

Sr. no. | Independent attributes in PIMA indian dataset | Description of independent attributes |
---|---|---|

1 | Pregnancy | Number of times a participant is pregnant |

2 | Glucose | Plasma glucose concentration in an oral glucose tolerance test |

3 | Diastolic blood pressure | Diastolic blood pressure (mm/Hg) |

4 | Skin thickness | Triceps skin fold thickness (mm). |

5 | Serum insulin | 2-h serum insulin (mu U/ml) |

6 | BMI | Body mass index (kg/m2) |

7 | Diabetes pedigree function | An attribute used in diabetes prognosis |

8 | Age | Participants’ ages |

Independent attributes in schorling dataset | Description of independent attributes | |

1 | Stab.glu | Stabilized glucose (mg/dL) |

2 | Age | Age (years) |

3 | Ratio | Cholesterol/High Density Lipoproteins (HDL) ratio |

4 | Waist | Waist (inches) |

5 | Chol | Total cholesterol (mg/dL) |

6 | Bp. s | Systolic blood pressure (mmHg) |

7 | Bp. d | Diastolic blood pressure (mmHg) |

8 | Frame | A factor with levels (small, medium, large) |

9 | Gender | Gender of subject (male, female) |

10 | Hdl | High density lipoprotein (mg/dL) |

11 | Height | Height (inches) |

12 | Hip | Hip (inches) |

13 | Weight | Weight (pounds) |

The K-fold cross-validation strategy is among the most widely used techniques for model selection and classifier error estimation [

The dataset was split into five mutually exclusive subsamples of equal size. The proposed approach was trained five times. Each time, we used four folds for the training and left one fold for testing. This approach has the advantage of reducing the bias associated with random sampling [

Several studies have shown that the accuracy and variety of the base-learners influence the effectiveness of any ensemble of classifiers [

To obtain the diversity, the proposed approach used heterogeneous base-learner model consists of six different machine learning algorithms. These algorithms were selected because they represent a wide variety of fields: neural networks, probabilistic models, statistical models, decision trees and ensemble learning. Moreover, the prediction models generated with these algorithms presented a good performance in several previous works.

The 5-fold cross-validation procedure was utilized to divide the dataset into five training and testing subsets, which increases the diversity of the input samples of the base learner. Then base learners, i.e., KNN, radial SVM, NB, linear SVM, NN and LightGBM take the five training and testing subsets as inputs and output the initial predictions of diabetes. The predictions of base-learners and their fuzzy clusters are then employed as inputs to the Logistic Regression (LR) algorithm, to achieve more generalized performance and reliability. LR algorithm generates the final diabetes prediction result.

The Artificial Neural Networks (ANN) technique is a well-known machine learning approach for dealing with complicated pattern-oriented problems in both categorization and time-series data types. An ANN algorithm has three major levels: the input layer, the hidden layer(s), and the output layer. By recognizing the intrinsic connections between different features, ANNs attempt to prepare a mapping between the input layer and the output layer. The hidden layer(s) analyses the information obtained from the input layer and then sends it to the output layer [

k-nearest neighbors

k-nearest neighbors (KNN) is a technique that uses a similarity measure such as distance functions to assign a new feature vector to a class in all possible cases. After determining the distance between the feature vector and all training samples, the new case is allocated to the class with the greatest probability.

NB is a classifier that is based on the Bayes theorem. It employs the concept of probability and implies that features are independent of each other [_{i}

The highest posterior of the classification variable is chosen using

Support vector machine (SVM)

Decision hyperplanes are employed in SVM classification. In input space or high-dimensional space, these hyperplanes define the decision boundaries. SVM creates linear functions (hyperplanes) using labeled training samples in order to divide data into two groups (positive or negative). The samples closest to the hyperplanes are known as support vectors. The margin of the SVM is defined as the distance between the support vectors and hyperplanes. SVM aims to increase this margin as much as possible [

LightGBM is a boosting approach for improving a model's performance by merging a group of weak classifiers into a strong classifier. The concept entails selecting weak classifiers in such a way that their performance is considerably enhanced when combined. Gradient boosting is what LightGBM is all about. It starts with decision trees as a weak learner and then uses gradient boosting to iteratively fit a sequence of trees. LightGBM is a decision tree-based model that is built leaf-by-leaf instead of depth-by-depth (as in other decision tree-based methods). As a result of this leaf-by-leaf generation, more complex trees are produced with greater accuracy [

Clustering is an unsupervised machine learning approach for identifying natural groupings or patterns in a dataset. The clustering methods ensure that all of the allocated observations in a group or cluster are related. Clustering methods can also aid in the discovery of hidden information in the data [

The FCM is based on the objective function minimization described below:
_{ij} denotes the probability that person x_{i} is part of the _{i} denotes the _{j} denotes the coordinates of the cluster's _{ij} and cluster centers C_{j} updated by

The probability coefficients reflect that a data item can belong to several clusters. Thus, person x_{i} has a probability of u_{ij} being a member of cluster j considering that

The FCM algorithm assists in the detection of hidden data. Therefore, integrating base-learner predictions with their clusters can increase the performance of the hybrid meta-classification learner and provide a more accurate diabetes prediction. FCM employed to obtain the membership values of the base-learner predictions’ probabilities to which of the two clusters (diabetic and non-diabetic) they belong. The prediction probability shares a similarity with each class or cluster is represented using membership functions whose value ranges between 0 and 1. In this paper, the membership function's value of the prediction probability for a class is computed using

Prediction by (Linear SVM) | Predictions by (NB) | Prediction by (NN) | Prediction by (KNN) | Prediction by (Radial SVM) | Prediction by |
Cluster 1 membership | Cluster 0 membership | Final cluster |
---|---|---|---|---|---|---|---|---|

0.0362 | 0.0004 | 0.0000 | 0.0383 | 0.0000 | 0.0304 | 0.0145 | 0.9854 | 0 |

0.0217 | 0.0001 | 0.0001 | 0.0026 | 0.0000 | 0.0507 | 0.0121 | 0.9878 | 0 |

0.1685 | 0.5000 | 0.0207 | 0.0401 | 0.0160 | 0.2354 | 0.8469 | 0.1530 | 1 |

0.0215 | 0.0196 | 0.0020 | 0.0005 | 0.0000 | 0.0417 | 0.0142 | 0.9857 | 0 |

0.2419 | 0.5395 | 0.0086 | 0.1728 | 0.1044 | 0.1727 | 0.9227 | 0.0772 | 1 |

0.0194 | 0.0125 | 0.0051 | 0.0119 | 0.0000 | 0.0481 | 0.0116 | 0.9883 | 0 |

0.1763 | 0.0612 | 0.0040 | 0.1231 | 0.0479 | 0.3956 | 0.3857 | 0.6142 | 0 |

0.0232 | 0.0741 | 0.0021 | 0.0273 | 0.0000 | 0.0686 | 0.0072 | 0.9927 | 0 |

0.1945 | 0.0423 | 0.0090 | 0.1714 | 0.9228 | 0.2371 | 0.5549 | 0.4450 | 1 |

The classification method aims to create a model capable of assigning data elements to a specific class based on the current data. It was used to derive important elements from the model or forecast the data trend. Usually, the dependent variable of the logistic regression algorithm is binary classification. In other words, the logistic regression method is often employed to solve two-category problems. The primary objective of our experiment is to predict whether an individual has diabetes, and this is a classic binary classification issue.

We selected logistic regression for our research from several supervised machine learning techniques. Logistic regression is a well-established technique that produces simple-to-understand models that have been demonstrated to be successful in a variety of situations and cases [

The logistic regression method is built on a linear regression model, which is explained in

The classification issue is similar to the linear regression issue. Only continuous values can be predicted using a linear regression. As the predictive value of the classification problem can only be 0 or 1, we can establish a critical point. If the value exceeds the threshold, 1 is returned; otherwise, 0 is returned. Logistic regression is a type of regression model in which the prediction scope is reduced, and the prediction score is limited to [0, 1]. Based on linear regression, logistic regression incorporates a sigmoid function level (non-linearity). Initially, the attributes are linearly added together before being predicted using the sigmoid function. The key formulas of the logistic regression algorithm are shown in

There are two classes in this research: diabetic and nondiabetic. The letter Y denotes that the person has diabetes. The features in the datasets are represented by X independent variables. Each dependent variable X is associated with the coefficient value

In this research, the prediction probabilities from the base learners (KNN, radial SVM, NB, linear SVM, NN and LightGBM) and their clusters generated by the Fuzzy C-means algorithm, were utilized as input to the Logistic regression algorithm, to get the final diabetes prediction result.

In our experiments, we evaluated the classification efficiency of the proposed hybrid ensemble scheme and comparison models using five performance assessment measures: accuracy, recall, precision, f1-measure, and AUC. These measures were determined as shown in

Metrics of performance | Mathematical formula | Notes |
---|---|---|

Accuracy | |
Calculates the percentage of properly classified instances in total. |

Recall | |
Calculates the proportion of diabetic samples that are correctly classified. |

Precision | |
Calculates the ratio between all the diabetic samples that are correctly classified and the total number of diabetic samples. |

F-measure | |
Precision and recall are averaged harmonically. |

Area under receiver operating characteristic curve (AUC) | |
Determines the diagnostic ability of the scheme to distinguish between persons with diabetes and without diabetes. |

Our proposed ensemble approach outperformed all the base-learner algorithms with the PIDD dataset, and achieved the highest accuracy value of 99%, followed by the soft voting ensemble method (98%), LightGBM classifier (93.46%), and hard voting ensemble method (90%). The worst accuracies were shown by NN (64.59%) and linear SVM (75.30%). In recall terms, the proposed approach achieved the highest value of 99%, followed by the soft voting ensemble method (97.21%) and LightGBM (82.38%). In precision terms, the suggested approach attained the highest value of 99%, followed by LightGBM (98.97%) and soft voting ensemble (97.51%). The proposed approach also achieved an average f-measure of 98%, outperforming the soft voting ensemble (97.55%) and the LightGBM classifier (89.88%).

Using the SDD, the proposed approach also outperformed all the base-learner algorithms and other ensemble methods. The proposed approach achieved the highest accuracy value of 95.20%, followed by the NB classifier (94.89%), the soft voting ensemble method (94.31%), and the hard voting ensemble method (93.38%). In recall terms, the proposed approach achieved the highest value of 95.40%, followed by the soft voting ensemble method (94.39), and the linear SVM classifier (93.75%). In precision terms, the suggested approach attained the highest value of 95.10%, followed by the soft voting ensemble (94.18%) and hard voting ensemble (93.41%). The proposed approach also achieved an average f-measure of 95.11%, outperforming the soft voting ensemble (94.01%) and hard voting ensemble (92.02%).

The promising and competitive performance results of the proposed approach demonstrate its superiority over traditional ensemble methods. The capability of the proposed approach to combine fuzzy clustering and logistic regression enables it to appropriately integrate the predictions from base-learners and provide a more accurate prediction of diabetes. As a result, the proposed hybrid approach outperforms the six individual classifiers and two ensemble approaches in terms of classification efficiency, as shown in

Dataset | Approach | Accuracy | Recall | Precision | f1 score | AUC |
---|---|---|---|---|---|---|

Pima Indians Diabetes Database (PIDD) | KNN | 0.8203 | 0.7964 | 0.8780 | 0.8352 | 0.9450 |

Radial SVM | 0.7599 | 0.7683 | 0.6634 | 0.7120 | 0.8850 | |

Naive bayes | 0.7582 | 0.6888 | 0.6048 | 0.6441 | 0.8210 | |

Linear SVM | 0.7530 | 0.7134 | 0.5951 | 0.6489 | 0.8410 | |

Neural network | 0.6459 | 0.7743 | 0.7463 | 0.7598 | 0.9090 | |

LightGBM | 0.9346 | 0.8238 | 0.9897 | 0.8988 | 0.9886 | |

Soft voting | 0.9800 | 0.9721 | 0.9751 | 0.9755 | 0.9790 | |

Hard voting | 0.9000 | 0.8752 | 0.9134 | 0.8900 | 0.9194 | |

Proposed ensemble approach | 0.9931 | 0.9910 | 0.9920 | 0.9814 | 0.9910 | |

Schorling Diabetes Dataset (SDD) | KNN | 0.8203 | 0.7964 | 0.5001 | 0.6250 | 0.7370 |

Radial SVM | 0.9285 | 0.8871 | 0.4021 | 0.5517 | 0.6940 | |

Naive bayes | 0.9489 | 0.8947 | 0.8532 | 0.8717 | 0.8210 | |

Linear SVM | 0.9387 | 0.9375 | 0.7527 | 0.8333 | 0.8690 | |

Neural network | 0.9356 | 0.9165 | 0.6536 | 0.6451 | 0.7190 | |

LightGBM | 0.9046 | 0.8238 | 0.8897 | 0.8988 | 0.9301 | |

Soft voting | 0.9431 | 0.9439 | 0.9418 | 0.9401 | 0.9217 | |

Hard voting | 0.9338 | 0.9317 | 0.9341 | 0.9202 | 0.8631 | |

Proposed ensemble approach | 0.9520 | 0.9540 | 0.9510 | 0.9511 | 0.9410 |

The AUC was generated for both the individual and ensemble classifiers to demonstrate the separation and discrimination capabilities of the models and to compute their specificity and sensitivity at various class prediction score thresholds. In general, an AUC curve that is increasingly closer to the top-left corner demonstrates an improved classification efficiency.

When combined, precision and recall are valuable metrics to use for unbalanced data; precision denotes the appropriateness of the result scale and its closeness to the anticipated solution, while recall denotes the number of related results. A high recall score indicates a low rate of false negatives, while a high precision score indicates a low rate of false positives. High precision and recall scores suggest that the classifier returns results accurately and retrieves the majority of positive results [

Moreover, we compared the performance of our proposed approach with experiments of other researchers using the same dataset PIDD to show that our proposed approach achieved a certain level of improvement. The proposed ensemble approach achieved an accuracy of 99%. The results of the other studies are listed in

No | Reference | Method | Accuracy |
---|---|---|---|

1 | [ |
Stack of SVM and ANN | 88.04% |

2 | [ |
Ensemble classifier based on dominance-based rough set and fuzzy random forests | 77% |

3 | [ |
Three-layer ensemble classifier based on a majority voting method | 93% |

4 | [ |
Two neural network ensemble classifiers based on cascade-forward back-propagation network (CFBN) and multilayer perceptron (MLP) | 96.88% |

5 | [ |
Ensemble classifier based on SVM, naive bayes, and decision tree algorithms | 90.36% |

6 | [ |
Method of evolutionary ensemble learning based on stacking | 83.8% |

7 | [ |
Competitive co-evolutionary neural networks | 78.2% |

8 | [ |
Hybrid system | 80.99% |

9 | [ |
Weighted multilayer classifier ensemble approach | 78.21 |

10 | This research | Ensemble learning based on hybrid meta-classifier of fuzzy clustering and logistic regression | 99% |

Diabetes mellitus refers to a group of metabolic disorders that are defined by persistently high blood glucose levels. Diabetes that remains undiagnosed may result in a variety of complications, including retinopathy, nephropathy, neuropathy, and other vascular disorders. This paper proposes a novel ensemble learning method for diabetes prediction that is based on a hybrid meta-classifier composed of fuzzy clustering and logistic regression. The proposed method is divided into two stages. The first stage is a base-learner, which consists of six machine learning algorithms for predicting diabetes. Second, a hybrid meta-learner is used to incorporate predictions from base-learners to provide the final prediction of diabetes. The hybrid meta-learner employs the FCM algorithm to generate highly significant clusters of predictions from base-learners. The predictions of the base-learner and their fuzzy clusters are then employed as inputs to the LR algorithm, which generates the final diabetes prediction result. Experiments were performed using two publicly available diabetes datasets to demonstrate the effectiveness of the proposed method for predicting diabetes. Comparing the proposed approach to other approaches revealed that the proposed approach outperformed the others and achieved the highest prediction accuracies of 99% and 95.20% using the PIDD and SDD datasets, respectively. Future research will focus on effective approaches for weighting base classifiers according to cluster attributes. The limitation of this study is considering only one dataset, it is recommended to consider various category diabetes datasets that have different sizes, and include all the factors that may contribute to the development of diabetes mellitus.

The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia “for funding this research work through the project number IFPHI-193-830-2020” and King Abdulaziz University, DSR, Jeddah, Saudi Arabia.