This study aims to empirically analyze teaching-learning-based optimization (TLBO) and machine learning algorithms using k-means and fuzzy c-means (FCM) algorithms for their individual performance evaluation in terms of clustering and classification. In the first phase, the clustering (k-means and FCM) algorithms were employed independently and the clustering accuracy was evaluated using different computational measures. During the second phase, the non-clustered data obtained from the first phase were preprocessed with TLBO. TLBO was performed using k-means (TLBO-KM) and FCM (TLBO-FCM) (TLBO-KM/FCM) algorithms. The objective function was determined by considering both minimization and maximization criteria. Non-clustered data obtained from the first phase were further utilized and fed as input for threshold optimization. Five benchmark datasets were considered from the University of California, Irvine (UCI) Machine Learning Repository for comparative study and experimentation. These are breast cancer Wisconsin (BCW), Pima Indians Diabetes, Heart-Statlog, Hepatitis, and Cleveland Heart Disease datasets. The combined average accuracy obtained collectively is approximately 99.4% in case of TLBO-KM and 98.6% in case of TLBO-FCM. This approach is also capable of finding the dominating attributes. The findings indicate that TLBO-KM/FCM, considering different computational measures, perform well on the non-clustered data where k-means and FCM, if employed independently, fail to provide significant results. Evaluating different feature sets, the TLBO-KM/FCM and SVM(GS) clearly outperformed all other classifiers in terms of sensitivity, specificity and accuracy. TLBO-KM/FCM attained the highest average sensitivity (98.7%), highest average specificity (98.4%) and highest average accuracy (99.4%) for 10-fold cross validation with different test data.

Data mining and machine learning algorithms are efficient in pattern identification, extraction and data separation through clustering and classification [

Pedireddla et al. [

The above review and analysis suggested the need of algorithms with aggregate functionality. It also depicted the concentration major on preprocessing and feature selection as the symptom's variability is higher in case of medical data.

In this paper, five different benchmark datasets have been considered. These are BCW dataset (Number of instances: 699, Number of features: 9, Number of classes: 2), Pima Indians Diabetes (Number of instances: 768, Number of features: 8, Number of classes: 2), Heart-Statlog (Number of instances: 270, Number of features: 13, Number of classes: 2), Hepatitis (Number of instances: 155, Number of features: 19, Number of classes: 2) and Cleveland Heart Disease (Number of instances: 296, Number of features: 13, Number of classes: 5). This has been taken from the UCI Machine Learning Repository [

The k-means clustering depends on the closest centroid. In case of a medical dataset, the data can be either malign or benign. If k-means is applied to these datasets, sometimes the initial centroids re-adjust themselves and sometimes they do not, and this process is repeated several times. The accuracy of the results highly depends on whether this process can provide the closest centroid or not. On the other hand, the FCM algorithm processes the data by allocating membership to each data point corresponding to each cluster center. The fuzziness shows the degree of truth (>1), whereas the termination criterion and epsilon value lie between 0 and 1. The process is repeated till the termination criteria. This may influence the results as the data point may be affected. So, there is the chance of trapping it into local optima. If the values are arranged considering an optimization problem, the above-mentioned problem can be solved to a great extent as the readjustment is already performed and the final outcome is more organized and normalized. If the k-means or FCM algorithm is then applied to this data, the clustering accuracy can be improved further.

In order to achieve a good performance, all optimization algorithms require the tuning of their parameters [

The proposed framework provides different functionalities and computational parametric variations with the solutions to variable problem areas. This implies that the required setup for the preprocessing and clustering of data is implemented and evaluated. This approach can then be utilized in suitable places according to need. The functionalities include data selection, preprocessing, partitioning, clustering, classification, and the computational parametric variations based on variable parameters. The proposed framework also provides a basic set of application tools which can be extended with different methodological prospects and dataset expansion, with new attributes for the classification and clustering purposes. In phase-I, only clustering algorithms (k-means and FCM) were used and the clustering accuracy was evaluated using different computational measures. In phase-II, the non-clustered data were treated with the TLBO. In the third phase, the non-clustered data obtained from the TLBO process were clustered using k-means and FCM algorithms. The TLBO-KM and TLBO-FCM (TLBO-KM/FCM) algorithms were used to find the most accurate clusters. The optimized objective function was determined by considering both minimization and maximization. Here, non-clustered refers to the remaining data by k-means and FCM after clustering. Termination criteria refers to the termination criteria in case of FCM for finalizing the clusters. The TLBO-KM/FCM algorithm depict the complete picture. The terms used in the algorithm is shown in

Symbol
Description
Population size (1, 2, .., n)
_{i}_{1}_{n} attributes
Subjects
Difference mean
_{i}Random number
_{f}Teaching factor
_{kbest}Teacher (best learner)
_{i}Mean values of the attributes
_{f}Interaction combination first
Interaction combination second
_{i}, Y_{i}Coordinates
_{i}Record number
Fuzziness (cluster fuzziness should be greater than 1)
Cluster center
Data point
Euclidean distance

For the experiment, the attributes A_{1}–A_{n} were considered. The objective function is shown in

Range of variables: 1 ≤ A_{i} ≤ n

The first difference mean, according to _{1}–A_{n}. The updated values were generated after different iterations based on

For comparative study and analysis different classification algorithms, along with our approach, have been considered for the experimentation. The classification algorithms used are RF, k-nearest neighbor (KNN), SVM, SVM with grid search (SVM (GS)) and NB. To avoid any ambiguous inference, each experiment is repeated for 50 cycles for the calculation of average accuracy.

Five different benchmark datasets have been considered for experimentation. These are BCW dataset (D1), Pima Indians Diabetes (D2), Heart-Statlog (D3), Hepatitis (D4) and Cleveland Heart Disease (D5).

This section discusses the outcome of TLBO-KM/FCM and machine learning algorithms in different cases. First, TLBO-KM/FCM results were considered with different cases with D1 dataset. For the comparison of the results, positive predictive value (PPV) was considered first (

In the case of k-means, foggy and random centroids have been used for initialization. The Euclidean distance algorithm is used to find the distance between the cluster center and the data points. The simple and variance split methods were applied for data splitting. The cluster centers were calculated based on the mean and variance.

S. No. | Cases | Parameters |
---|---|---|

1 | Case 1 | TLBO design variables and foggy centroid |

2 | Case 2 | TLBO design variables and foggy centroid with complete population |

3 | Case 3 | TLBO design variables and random centroid |

4 | Case 4 | Variations in different epochs |

5 | Case 5 | Variations in variance and same centroid |

6 | Case 6 | Variations in TLBO knowledge transfer (interaction cycle) |

For Case 1, the results were obtained using foggy centroid, Euclidean distance, simple-split method, epoch, and variations in the design variables with 10-fold cross validation in a complete cycle. The simple-split method is used to cluster more elements. The epoch determines the stopping condition of the iteration in the process of identifying the cluster center.

The parameters remain the same for Cases 2–5. However, the whole population was considered here. The results of k-means for Cases 2, 4, and 5 with the highest and lowest clustering accuracy of (91.0% and 86.0%), (92.0% and 89.0%), and (94.0% and 90.0%), and the average accuracies of (89.6% and 85.4%), (90.6%, 88.3%), and (91.4%, 89.7%) for cases 2, 4, and 5, respectively. The non-clustered records are then processed with the TLBO-KM. The highest and lowest minimization clustering accuracies are (98.0%, 92.0%), (100%, 97.0%), and (99.0%, 94.0%), while that of the maximization are (97.0%, 94.0%), (98.0%, 92.0%), and (95.0%, 92.0%) for Cases 2, 4, and 5, respectively. The average clustering accuracies in case of minimization and maximization are (95.6%, 91.4%), (98.8%, 96.4%) and (98.8%, 92.7%), respectively. For Case 3, as the initialization remained the same in all iterations, no variations are found in the case of means. Although the results may vary with the TLBO-KM, the variations are caused by the random initialization only. Therefore, the specific results of Case 3 are not presented. These results are shown in

For Case 6, the same parameters were used with a completely random selection of attributes, with variation in TLBO knowledge transfer (interaction cycle).

Thereafter, the FCM algorithm was applied. The experimentation was performed on the basis of the variation in the fuzziness value and termination criteria. In our approach, the fuzziness variations considered from 2–5 and the epsilon value lie between 2 × 10^{–5} and 6 × 10^{–5}.

S. No. | FCM (Accuracy (%)) | Epsilon factor | Fuzziness factor |
---|---|---|---|

1 | 0.93 | E-1 | 2 |

2 | 0.97 | 3 | |

3 | 0.97 | 4 | |

4 | 0.93 | 5 | |

5 | 0.96 | E-2 | 2 |

6 | 0.92 | 3 | |

7 | 0.92 | 4 | |

8 | 0.97 | 5 | |

9 | 0.96 | E-3 | 2 |

10 | 0.91 | 3 | |

11 | 0.96 | 4 | |

12 | 0.96 | 5 | |

13 | 0.92 | E-4 | 2 |

14 | 0.94 | 3 | |

15 | 0.96 | 4 | |

16 | 0.95 | 5 | |

17 | 095 | E-5 | 2 |

18 | 0.96 | 3 | |

19 | 0.92 | 4 | |

20 | 0.96 | 5 |

Note: E-1 = 2 ×

Mean, standard deviation (SD) and the standard error of the mean (SEM) were considered for the variability variations from the complete population. Mean shows the average of the weight instances divided by the complete numbers. SD and SEM have been used for the presentation of the data characteristics. SD has been used to show the accurate dispersion of the individual values. SEM has been used for statistical inference. The variance has also been discussed to check the suitability of the objective function. The mean (

_{i}

Accuracy: It shows the rate of outcomes which are predicted based on the total outcomes. It is shown in

Sensitivity: It shows the rate of outcomes which are predicted positive to all outcomes for the yes. It is shown in

Specificity: It shows the rate of outcomes which are predicted negative to all outcomes for the no It is shown in

For comparative study and analysis different classification algorithms, along with our approach, were considered for the experimentation. The algorithms used were RF, KNN, SVM, SVM (GS) and NB. To avoid any ambiguous inference, each experiment is repeated for 50 cycles for the calculation of average accuracy.

In this study, k

The TLBO was used for the data-preprocessing and the TLBO-KM/FCM outperforms in all cases.

In Case 1 (BCW dataset), at first, only k

In Cases 2–5 (BCW dataset), instead of selecting randomly, the whole population was considered. Case 2 includes the variations in TLBO design variables and foggy centroid, and Case 3 additionally includes the variations in random centroid. Case 4 includes the variations in different epochs. Case 5 includes the variations in the variance and same centroid. The clustering accuracies obtained by k

In Case 3 (BCW dataset), no variation was detected as the initialization remains the same in all iterations. The results may vary with TLBO. However, the variation caused by the random initialization is already covered in other cases.

In Case 6 (BCW dataset), the whole population with a completely random selection of attributes, with the variations in TLBO knowledge transfer (interaction cycle), was considered. The clustering accuracy obtained is approximately 91% in the case of k-means. The TLBO-KM applied to the non-clustered data achieves an average clustering accuracy of approximately 99% and 98% for the minimization and maximization, respectively. It is depicted from the results that TLBO-KM performs.

The clustering accuracies obtained were approximately 95%. The TLBO-FCM with different epsilon values and fuzziness factors achieve an average clustering accuracy of approximately, 97% and 98%, respectively. It is depicted from the results that TLBO-FCM performs better in comparison to FCM alone.

Therefore, TLBO-KM/FCM is efficient when compared to the k

The combined average accuracy obtained collectively is approximately 99.4% in case of TLBO-KM and 98.6% in case of TLBO-FCM.

Evaluating different feature sets, the TLBO-KM/FCM and SVM(GS) clearly outperformed all other classifiers in terms of sensitivity, specificity and accuracy. TLBO-KM/FCM attained the highest average sensitivity (98.7%), highest average specificity (98.4%) and highest average accuracy (99.4%) for 10-fold cross validation with different test data.

The experimental framework has been developed in NETBEANS 7.2 IDE (Apache Software Foundation, Wakefield, USA). The Java Development Kit (JDK) (Oracle Corporation, California, USA) version is 1.7., using an Intel^{®} Core™ i5–7200 U CPU running at 2.8 GHz with 4 GB RAM. The system type is a 64-bit operating system and ×64-based processor. This experiment can be replicated and enhanced in future by changing centroid calculation and validating different distance measures. Different combinations of data mining, classification algorithms and evolutionary algorithms may be used, but how these algorithms can be used together and which techniques will be more effective in combined form are the points warrant future research. This work can be extended for datasets with different arity and attributes.

In this study, TLBO-KM/FCM and machine learning algorithms were used for the clustering and classification of medical datasets. In order to compare their efficiency, they were applied separately to the same dataset. Various computational measures of integrative clustering were taken into account using multivariate parameters such as foggy centroid, random centroid, epoch variations, design variables, fuzziness value, termination criteria, and interaction cycle. For the explanation and discussion, the BCW dataset has been considered first. The TLBO-KM was able to cluster 99.4% and 97.4% of the non-clustered data (produced by applying k-means alone) in the case of minimization and maximization, respectively. Similarly, TLBO-FCM was able to cluster 98.6% and 96.4% of the non-clustered data (produced by applying FCM alone) in the case of minimization and maximization, respectively. The combined average accuracy obtained collectively is approximately 99.4% in case of TLBO-KM and 98.6% in case of TLBO-FCM. Moreover, the variations in the results of minimization and maximization were small. Thus, it can be inferred that our approach produces better results for the minimization or the maximization of the objective function. When the results of minimization and maximization are compared, it is seen that the minimization cases produce a better result. This approach is also useful in the determination of the dominating attributes. The TLBO-KM/FCM and SVM (GS) clearly outperformed all other classifiers in terms of sensitivity, specificity and accuracy. It shows the highest average sensitivity (98.7%), highest average specificity (98.4%) and highest average accuracy (99.4%) for the 10-fold cross validation. The present study suggests that the TLBO-KM/FCM with different computational measures and multivariate parameters, in different iterations and multiple TLBO preprocessing cycles, can efficiently handle medical data.