Breast cancer seriously affects many women. If breast cancer is detected at an early stage, it may be cured. This paper proposes a novel classification model based improved machine learning algorithms for diagnosis of breast cancer at its initial stage. It has been used by combining feature selection and Bayesian optimization approaches to build improved machine learning models. Support Vector Machine, K-Nearest Neighbor, Naive Bayes, Ensemble Learning and Decision Tree approaches were used as machine learning algorithms. All experiments were tested on two different datasets, which are Wisconsin Breast Cancer Dataset (WBCD) and Mammographic Breast Cancer Dataset (MBCD). Experiments were implemented to obtain the best classification process. Relief, Least Absolute Shrinkage and Selection Operator (LASSO) and Sequential Forward Selection were used to determine the most relevant features, respectively. The machine learning models were optimized with the help of Bayesian optimization approach to obtain optimal hyperparameter values. Experimental results showed the unified feature selection-hyperparameter optimization method improved the classification performance in all machine learning algorithms. Among the various experiments, LASSO-BO-SVM showed the highest accuracy, precision, recall and F1-score for two datasets (97.95%, 98.28%, 98.28%, 98.28% for MBCD and 98.95%, 97.17%, 100%, 98.56% for MBCD), yielding outperforming results compared to recent studies.

Breast cancer (BC) have been considered as the most diagnosed malignant disease among females in recent years [

The summary of this article and its contribution to science is given below:

Decision Tree (DT), Naive Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbor (K-NN) and Ensemble Learners (EL) were used to classify malign and benign breast lesions.

Three different feature selection methods namely, Relief (RF), LASSO and Sequential Forward Selection (SFS) were used to determine the most selective and discriminative features for effective identification of BC.

Bayesian optimization (BO) algorithm was utilized for optimizing the classification algorithms.

The statistical measures (accuracy, precision, recall and F1-Score) were used to measure the performance of suggested classification model which was implemented in MATLAB software.

A series of comparative analyses are performed on the Wisconsin Breast Cancer Dataset (WBCD) which was retrieved from the UCI machine learning repository and Mammographic Breast Cancer Dataset (MBCD) which has never been used before.

The rest of the study is presented as follow: The literature studies are summarized in Section 2. The general structure of methods and methods are given in Section 3. The results are demonstrated in Section 4. The results are discussed in Section 5 and Section 6 presents the conclusion.

Numerous studies have been investigated to provide early diagnosis of BC in recent years. There are various ML algorithms that are being used for this purpose [

In the literature, many feature selection methods used to select optimal features. Three techniques are used in FS: filter, wrapper and embedded. [

Machine learning contains many hyperparameters and these hyperparameters need to be set automatically to optimize the performance. HO is an approach that chooses a set optimal hyperparameters for a machine learning algorithm. In the literature, grid search, random search, and Bayesian optimization have been used to automatically set hyperparameters in machine learning [

The suggested classification model is illustrated for prediction of BC in

The working principle of suggestion model is shown Algorithm 1.

The first dataset is the WBCD that consists of 569 instances with 32 features. The dataset contains 32 features (ID Number, Diagnosis and 30 input features). The features extracted from images of cell nuclei. Each instance is labeled as benign and malign. There are 356 benign and 213 instances [

No | Features | No | Features |
---|---|---|---|

1 | Radius mean | 16 | Compactness severity |

2 | Texture mean | 17 | Concavity severity |

3 | Perimeter mean | 18 | Concave points severity |

4 | Area mean | 19 | Symmetry severity |

5 | Smoothness mean | 20 | Fractal simension severity |

6 | Compactness mean | 21 | Radius worst |

7 | Concavity mean | 22 | Texture worst |

8 | Concave points mean | 23 | Perimeter worst |

9 | Symmetry mean | 24 | Area worst |

10 | Fractal dimension mean | 25 | Smoothness worst |

11 | Radius severity | 26 | Compactness worst |

12 | Texture severity | 27 | Concavity worst |

13 | Perimeter severity | 28 | Concave points worst |

14 | Area severity | 29 | Symmetry worst |

15 | Smoothness severity | 30 | Fractal dimension worst |

The second dataset is the MBCD which includes a total of 195 breast tumors (116 images (59%) for malign, 79 (41%) images for benign). This dataset was a retrospective study, and it was retrieved from Ankara Training and Research Hospital. This retrospective study was approved by the Institutional Ethics Committee of Ankara Training and Research Hospital (319/E-20). All patients who underwent digital mammography between April 2015 and April 2020 were retrieved from the Picture Archiving and Communication System (PACS). All patients underwent mammography using IMS Giotto (Bologna-Italy). Patient consent was obtained on the condition that all data were anonymized. The mammogram images were subject to segmentation process to determine the region of interest (ROI) which represents breast tumors. The process of extraction of ROI is shown in

A total of 54 shape and texture features were calculated for each ROI. Intensity, Grey-level Co-occurrence Matrix (GLCM) and Gray Level Run Matrix (GLRM) were used to generate texture features. 16 shape features (F1–F16), 15 intensity-based features (F17–F31), 13 GLCM (F32–F43) and 11 GLRM (F44–F54) features were calculated [

No | Features | No | Features | No | Features |
---|---|---|---|---|---|

1 | Area | 19 | Variance | 37 | Sum of mean |

2 | Perimeter | 20 | Smoothness | 38 | Sum of variance |

3 | Max. Radius | 21 | Skewness | 39 | Sum of entropy |

4 | Min. Radius | 22 | Kurtosis | 40 | Difference variance |

5 | Euler Number | 23 | MAD | 41 | Difference entropy |

6 | Eccentricity | 24 | Minimum | 42 | Information measure of correlation 1 |

7 | Solidity | 25 | Maximum | 43 | Information measure of correlation 2 |

8 | Entropy | 26 | 10_{th} Percentile |
44 | SRE |

9 | Equiv. Diameter | 27 | 90_{th} Percentile |
45 | LRE |

10 | Elongatedness | 28 | IQR | 46 | GLNU |

11 | Circulation 1 | 29 | Range | 47 | RLN |

12 | Circulation 2 | 30 | RMS | 48 | RP |

13 | Compactness | 31 | Median | 49 | LGRE |

14 | Dispersion | 32 | Contrast | 50 | HGRE |

15 | Thinness ratio | 33 | Correlation | 51 | SRLGE |

16 | Shape index | 34 | Energy | 52 | SRHGE |

17 | Mean | 35 | Homogeneity | 53 | LRLGE |

18 | Std. Deviation | 36 | Sum of Square | 54 | LRHGE |

Data normalization is a preprocessing technique that aims to identify numeric values in the datasets within a fixed range. In this study, z-score normalization method was used. Z-score is a technique that represents the number of standard deviations away from the mean [

FS is used to eliminate redundant and irrelevant features. Removing the irrelevant features improves the machine learning classification performance and reduces the computational cost of modeling [

Cross validation is a resampling model that divides into two groups; training and testing. 10-fold cross validation is used for evaluation of models. 90% of data were used for training, while 10% of data were used for testing purpose [

In this study, DT, NB, SVM, K-NN and EL algorithms were used for classification. In the DT approach, simple decision rules are used to estimate the value of target. This simple decision rules are extracted from the data. DT is generally used classification and regression process [

Hyperparameters are very important effect on performance of ML because they directly affect the training process. For example, box-constraint, kernel parameter and kernel scale are very important for SVM. Moreover, maximum number of splits affect the performance of decision tree. These hyperparameters are essential to be set to obtain excellent result. HO provides automation of the selection of hyperparameter values [

In this manuscript, BO was used to select automatically hyperparameters for machine learning algorithm. This algorithm is an effective approach for parameter search and is a black-box optimization technique. Algorithm builds a probabilistic model by setting a prior probability distribution over the function being optimized. Then, it combines with sample information to obtain a posterior function [

Algorithm | Hyperparameters | Search range |
---|---|---|

DT | ‘Maximum number of splits’ | [1–568] |

‘Split Criterion’ | Gini's diversity index, Maximum deviance reduction | |

NB | ‘Distribution names’ | Gaussian, Kernel |

‘Kernel Type’ | Gaussian, Box. Epanechniko, Triangle | |

SVM | ‘Kernel Function’ | Gaussian, Linear, Quadratic, Cubic |

‘Kernel Scale’ | [0.001–1000] | |

‘Box Constraint Level’ | [0.001–1000] | |

‘Standardize data’ | True/False | |

EL | ‘Ensemble Method’ | Bag, GentleBoost, LogitBoost, AdaBoost, RUSBoost |

‘Number of Learners’ | [10–500] | |

‘Learning Rate’ | [0.001–1] | |

‘Maximum number of splits’ | [1–568] | |

K-NN | ‘Number of Neighbors’ | [1–285] |

‘Distance Metric’ | City Block, Chebyshev, Correlation, Euclidian, Hamming, Jaccard, Mahalonobis, Minkowski, Spearmen |

Confusion matrix is used to visualize the success of ML models. True positive (TP), True negative (TN), False positive (FP) and False negative (FN) values need to be identified to measure the confusion matrix. TP means that malign cases are properly recognized as malign. TN means that benign cases are properly identified as benign. FP means that benign cases are mistakenly identified as malign. FN means that benign cases are mistakenly recognized as malign. Accuracy is the proportion of the cases correctly identified to entire cases. Precision is calculated by dividing true positive by overall positives. Recall is defined that the percentages of true positive among the real positive cases. F1-Score demonstrates the harmonic mean of precision and recall values [

Different experiments were implemented on machine learning methods to achieve the best classification rates for BC datasets. Experiments first started by using the functions developed by the MATLAB 2020a program. Then, by using all the features, MATLAB Statistics and Machine Learning Toolbox program [

Dataset | Method | Number | Selected features |
---|---|---|---|

WBCD | RF | 16 | 2, 4, 6, 7, 10, 12, 14, 19, 23, 27, 33, 40, 43, 46, 48, 49, 51 |

LASSO | 8 | 2, 7, 16, 17, 19, 31, 35, 47 | |

SFS | 3 | 7, 32, 48 | |

MBCD | RF | 12 | 1, 2, 7, 11, 13, 14, 17, 19, 21, 22, 23, 29 |

LASSO | 10 | 1, 2, 8, 11, 15, 18, 21, 22, 25, 30 | |

SFS | 4 | 1, 21, 22, 23 |

The classification results based on accuracy for WBCD and MBCD are shown in

Method | DT | NB | SVM | K-NN | EL |
---|---|---|---|---|---|

ALL | 92,1 | 93,5 | 95,96 | 95,43 | 95,08 |

BO | 92,8 | 94,38 | 96,84 | 95,78 | 95,61 |

RF-BO | 94,03 | 98,77 | 96,48 | 96,30 | |

LASSO-BO | 95,43 | 97,19 | |||

SFS-BO | 94,55 | 95,08 | 97,19 | 97,01 |

Method | DT | NB | SVM | K-NN | EL |
---|---|---|---|---|---|

ALL | 91,28 | 90,26 | 90,77 | 85,13 | 89,23 |

BO | 92,31 | 91,28 | 92,82 | 90,26 | 91,79 |

RF-BO | 93,85 | 95,9 | 94,36 | 95,41 | |

LASSO-BO | 94,36 | 94,36 | 95,38 | ||

SFS-BO | 93,85 | 96,41 | 95,9 |

As observed in

For efficient early diagnosis of BC, a classification model based on improved machine learning algorithms presented in this study. The proposed classification model was tested on two different BC datasets. Initially, all features were given directly as input to machine learning algorithms. Then, machine learning methods were optimized with the help of Bayesian optimization method to improve the classification performance and all features were given to ML as input. Finally, we combined RF, LASSO and SFS and Bayesian optimization approach to improve ML and guarantee the performance efficiency of ML. In

The

Methods | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|

LASSO-BO-DT | 95,43 | 94,33 | 93,46 | 93,9 |

RF-BO-NB | 96,66 | 97,17 | 94,06 | 95,59 |

LASSO-BO-SVM | ||||

SFS-BO-K-NN | 98,06 | 95,28 | 99 | 97,35 |

LASSO-BO-EL | 98,24 | 98,11 | 97,08 | 97,65 |

The best classification rates for MBCD are shown in

Methods | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|

RF-BO-DT | 95,38 | 96,55 | 95,72 | 96,14 |

SFS-BO-NB | 95,9 | 95,69 | 97,37 | 96,52 |

LASSO-BO-SVM | ||||

SFS-BO-K-NN | 96,92 | 98,28 | 96,61 | 97,44 |

LASSO-BO-EL | 97,43 | 98,28 | 98,24 | 97,85 |

The confusion of matrices of LASSO-BO-SVM methods for WBCD and MBCD are showed in

The proposed method (LASSO-BO-SVM) are compared to six recent studies with using WBCD in

Reference | Methods | Dataset | Accuracy |
---|---|---|---|

Mate et al. [ |
BO-ETC | WBCD | 96.52% |

Kumar et al. [ |
BO-RF Classifier | WBCD | 96.14% |

Thawkar et. [ |
BOA-ALO-ANN | WBCD | 98.16% |

Asri et al. [ |
SVM | WBCD | 97.13% |

Bensaoucha et.al. [ |
BO-SVM | WBCD | 96.52% |

Khandezemin et al. [ |
LR-GMDH | WBCD | 97.9% |

Note: LR: Logistic Regression, GMDH: Group Method Data Handling, RF: Random Forest, BOA: Butterfly Optimization Algorithm, ALO: Ant Lion Optimizer, ANN: Artificial Neural Network, ETC: Extra Tree Classifier.

Recent years witnessed many studies toward the diagnosis of BC in its initial stage. Although much effort has been directed to this field, it is still very challenging for researchers to choose the right method for an effective diagnostic model. The study proposes a novel classification model based on improved ML algorithms combining RF, LASSO and SFS methods and Bayesian optimization for efficient diagnosis of BC. Among the many variations, the LASSO-BO-SVM method depicted the highest accuracy, sensitivity, precision and F1-score for BC datasets. With these high classification rates, LASSO-BO-SVM technique has a potential to help radiologists for making more accurate BC diagnosis decisions. In the future works, we will use deep learning models for early diagnosis of BC and compare machine learning algorithms and deep learning models.

The authors would like to acknowledge the Department of Radiology of Ankara Training and Research Hospital, Ankara for their support in this work.