COVID-19 Infected Lung Computed Tomography Segmentation and Supervised Classification Approach

: The purpose of this research is the segmentation of lungs computed tomography (CT) scan for the diagnosis of COVID-19 by using machine learning methods. Our dataset contains data from is devoted to the use of a new statistical model to fit the main datasets of COVID-19 collected in Pakistan.


Introduction
An epidemic alarmed the world when pneumonia began to move from one human to another. The severe respiratory syndrome is caused by the coronavirus, which is new biology in the family of already-known viruses (single-stranded RNA viruses (+ssRNA)) mostly found in animals [1]. It is a curable disease, but it can also be life-threatening with a 3% death rate and a 7.5% reproductive rate. Acute illness can cause death due to massive lung damage and difficulty breathing. This virus spreading started from China's Hubei province capital (Wuhan), which is recognized from two categories: the "middle east respiratory syndrome" (MERS) and the "severe acute respiratory syndrome" (SARS) [2]. On the 11th of February 2020, the world health organization (WHO) specified that the virus is new, and was a Coronavirus disease 2019 . COVID-19 has become the greatest challenge for the survival of mankind due to its exponential growth and non-availability of vaccines or any confirmed medication [3]. Over 89,318,701 confirmed cases and 1,920,711 deaths have been reported until January 09, 2021, from across the globe. Globally, the mortality of the disease estimated by WHO is 3.4% but varies from region to region depending upon several factors such as climate, travel history, sociability, etc. [4]. The data are based on confirmed reported cases. They are certainly underestimated because several reports indicated the low percentage of reporting in their respective territories due to several reasons. One of the much-anticipated reasons highlighted in the reports is the smaller number of diagnostics. The diagnosis of the disease earlier made by clinical symptoms (fever, cough flu, etc.), travel, and epidemiological history. If a person is diagnosed positive, this can be confirmed by Computed Tomography (CT) images or a positive pathogen test (as there is no symptoms of the disease and the possibility of an infected person without so-called symptoms) [5]. Although pathogen testing based on real-time RT-PCR is considered a scientific tool for disease diagnosis, the quality, stability and reproducibility of the method are still in question. The questionable quality of the kits and delay in test results are forcing scientists to look for other new tools to diagnose disease which produces rapid results that are at least as effective as the PCR test. Several alternative diagnostic tools based on artificial intelligence and machine learning have been proposed [6]. Early diagnosis of this disease and transfer of the patient to quarantine (specialized hospital) on time has proved beneficial for different countries. The process of diagnosing this disease is relatively fast, but the upfront cost diagnostic tests can be a disaster for the patient and for the state, especially in countries where there is no positive health system due to poverty [7].
In this study, we use Deep Learning J4 (DLJ4) classifier based on Deep Learning (DL). The DL is largely responsible for the current growth in the use of artificial intelligence (AI). Let us mention that DL is a combination of machine learning techniques and AI plays an important role in the medical field image classification tasks since its creation. The DL technique is pretty useful in mining, analyzing, and recognizing patterns especially from medical data, and resulting in beneficial clinical decision making [8]. Technically, the DL is a first-class of algorithms that's is scalable and, due to the availability of high-tech computers, its performance keeps improving as you feed them more data. More precisely, the DL classifiers operate from multiple layers of artificial neural network (ANN) classifiers, each layer moves one simple representation of the data to the next layer. Also, most machine learning (ML) classifiers perform well on small datasets (with a hundred columns for instance). A digital image (un-structured) dataset has become a large number of feature vector spaces (FVS), so much so that the process becomes unusable [9]. A digital image size of (800 × 1000) has 2.4 million FVS, and it is too difficult to handle for ML classifiers. DL classifiers gradually learn more about this digital image as it goes through each ANN layer. The early layers learn how to detect lower-level features, such as edges, and the subsequent layers combine the features from the initial layers into a more comprehensive representation [10].
In this research, we propose a novel segmentation framework, called fuzzy c-mean automated region-growing segmentation (FARGS), for the diagnosis of COVID-19 using CT-Scan. Our methodology is based on the following elements: • Firstly, we collect abnormal lung CT images divided into three classes (Normal, Pneumonia, and COVID-19) and transform them into an 8-bit grayscale image format. • Secondly, at the preprocessing stage, gray level lungs CT-scan are divided into four equal parts. For this action a group of neighboring pixels are used for extracting a recognizable region of interest. Histogram stretch filter is employed to enhance the contrast. Note that, for a better visibility the gray level images are transformed in natural binary image format. At the postprocessing stage, we employ a novel segmentation approach called FARGS. • After the segmentation process, statistical features are extracted from an abnormal region of CT images. • Chi-square feature reduction technique is deployed for the optimize statistical features dataset. • Finally, five machine learning classifiers are deployed on an optimized statistical feature dataset.
Much research is underway these days of diagnosis of COVID-19. Many researchers have tried to find out the best solution for the diagnosis of COVID-19 using version medical image modalities. The most popular methodologies are summarized in Tab. 1, as well as the one proposed in this study for preliminary comparison.

Material and Methods
This study considers a dataset that contains lung disorders divided into three classes (Normal, Pneumonia, and COVID-19) which are determined by using CT images as shown in Fig. 1 below. The patients prone to the epidemic were selected on the basis of the dataset. The CT images dataset is collected from two different sources, the first one is the Radiology Department of Nishtar Hospital Multan and Civil Hospital Bahawalpur, Pakistan, and the second one is a publicly free available medical imaging database known as Radiopaedia (https://radiopaedia.org/). For each class, 400 patients were selected to examine their lung disorder using CT images of size (620 × 620), and a total of 1200 (400 × 3) CT images have been acquired. The expert radiologist manually inspects all images based on various medical tests and biopsy reports. Finally, in the presence of expert advice, we develop novel fuzzy c-mean automated region growing segmentation technique.

Proposed Methodology
In this section, we briefly discuss the proposed methodology. During the first step, all the image dataset is examined in a computer vision software library called OpenCV [19]. The second step is image preprocessing. Firstly, digital CT images are transformed into a grayscale 8-bit format. Secondly, we divide the image into four equal segments and extract the exact part of the lung for observation. Thirdly, histogram stretch is employed to normalize the non-uniformities. During the CT image data acquisition, speckle noise is detected due to the environmental conditions of the imaging sensor. To resolve this problem, grayscale images are transformed into a natural binary which improves contrast. The third step is segmentation, which will help to nominate the exact position and enhance the surface of the lesion. Mostly this process is time-consuming because it is based on the expert radiologist. To resolve this problem, a novel fuzzy c-mean automated region-growing segmentation (FARGS) is used on a preprocessed lung disorder CT image dataset. The fourth step is the hybrid statistical feature extraction. In this step, "texture" and "gray-level run-length matrix" (GLRLM) features are extracted from the CT image dataset. The fifth step is a hybrid statistical feature reduction. In it, we select twelve optimized hybrid statistical features from the total extracted features dataset using the chi-square feature reduction technique. The last step is classification, where five ML classifiers named as "Deep Learning J4" (DLJ4), "Random Forest" (RF), "Support Vector Machine" (SVM), "Multilayer Perceptron" (MLP), and "Naive Bayes" (NB) have been deployed on optimized hybrid statistical features dataset. They use 10-folds validation approach for the diagnosis of COVID-19 as shown in Fig. 2 below. Now, let discuss the Lung CT-scan segmentation for the diagnosis of the COVID-19 proposed algorithm, with all the practical steps.
Return to step (Update). } Extract 52 hybrid statistical feature dataset. Select 12 optimized, hybrid statistical feature dataset using chi square approach. End For } Deep learning J4 classifiers are employed on optimized hybrid statistical feature dataset. Output = COVID-19.

Fuzzy c-mean Automated Region-growing Segmentation (FARGS)
There are several approaches to image segmentation, mainly based on expert opinion that is a time-consuming process [20], while fuzzy c-mean automated region growing segmentation free from human-based expertise. At the preprocessing stage, gray level lungs CT-scan is divided into four equal parts, a group of neighboring pixels is utilized for extraction of a recognizable region of interest. Histogram Stretch filter is employed to enhance the contrast (better visibility gray level image is transformed in natural binary image format). Lastly, we use a fuzzy c-mean segmentation approach [21], which is mainly used for pattern classification. This segmentation approach divides data into two segments. It is based on the following objective function (OF): where 1 ≤ q ≤ ∞, and a real number, η ij is the degree of membership of y i in cluster j, y i is the ith dimensional measured data, ζ j is the dimensional center of the cluster, and * is any average expressing the similarity between any measured data and the center. Fuzzy partitioning is performed by repeated revisions of the OF, along with the renewal of membership η ij and the cluster centers ζ j by: Repetition stop if the following condition: where is an elimination criterion between 0 and 1, while k is the repetition steps. This method converts to a local minimum of q . Finally, the FARGS approach is applied to the lungs disorder dataset as represented in Fig. 3 below.

Feature Extraction
The OpenCV computer vision software library, is used for the hybrid statistical feature extraction process that holds texture and GLRLM features. These features are grouped as 5 textures and 8 GLRLM features including 4 dimensions (0, 45, 90, and 135 degrees), and a total of 52 (13 × 4) extracted features. The extracted dataset has a large FVS size of 62,400 (1200 × 52) for the diagnosis of COVID-19.
where c and k are the spatial coordinates and ρ ck is gray level values. The correlation is specified by Also, the formula of the entropy is the following: The IDE can be defined as Finally, the inertia is obtained as

Gray Level Run-Length Matrix (GLRLM)
We now consider the gray-level run-length matrix (GLRM) [24], which can be defined as a section of gray also known as a range or length of run that is a linear multitude of continuous pixels with the same gray level in a particular direction. Let β g be the number of discreet intensity values in the image, β r be the number of discreet run lengths in the image, β p be the number of pixels in the image, β r (ϑ) be the number of runs in the image along angle ϑ and ψ(v 1 , v 2 | ϑ) be the run-length matrix for an arbitrary direction ϑ. Then, the Gray level non-uniformity is described by Run length non-uniformity is defined in Eq. (10): Run length non-uniformity normalized is defined in Eq. (11): Run percentage is shown in Eq. (12): Low gray level run emphasis can be described as High gray level run emphasis is described in Eq. (14): Grey level variance is given by Finally, run length variance is presented in Eq. (16)

Feature Reduction
For feature reduction, the selected features have been replaced by a lower dimension. Instead of a low-dimensional feature, it retains the original data structure as much as possible [25]. The low-dimensional feature space also reduces the time and cost of execution, and the results obtained are almost comparable to the original feature space. Feature selection (FS) [26] is the process by which a large number of features are extracted. Its main objective is to select the most important features. Usually a large size data is needed to manage a large number of features, which is not an easy task. It is important to minimize the vector space dimension of this feature, which can effectively differentiate and classify different classes. These techniques have been implemented to achieve highly discriminant features. Finally, most of the discriminant features are used to achieve cost-effective classification accuracy. A common way to select a feature that is used in a statistical dataset is the chi-square feature reduction [27]. The mathematical foundation of the chi-square feature reduction is given by where N is the observed frequency, E is the expected frequency, if the document contains the terms i and zero, then the value of Nγ i γ j is 1 and if the document is in class j and zero, the value of Eγ i γ j is 1. In this study, we select the most discriminant feature for the COVID-19 classification. The proposed chi-square approach selects 12 optimize features out of 52 features.

Classification
In this research, five ML classifiers, namely DLJ4, MLP, SVM, RF, and NB, are deployed on an optimize hybrid statistical features dataset utilizing 10-folds validation for the diagnosis of COVID-19. We observe that the DLJ4 classifier performs well compared to other implemented classifiers. We explain this performance due to the complexity of the data which is an aspect often treated well by DLJ4 in general. The mathematical foundation of DLJ4 classifier [28] is described below. The production of input weight and bias are summed using the summation function (σ n ) specified as Here, c is the number of inputs, J n is the input variable J, μ j is the bias term and λ ln is the weight. There are many activation functions of DLJ4, as the one given as The output of neuron j can be obtained as

Results and Discussion
The overall classification accuracy of lung disorders optimizes hybrid statistical features with deployed ML classifiers with other performance evaluating factors such as the "Kappa statistic" which is a metric in which the observed accuracy is compared with the prediction accuracy, "True positive" (TP), which is a result where the model accurately predicts a positive class, "False positive" (FP) which is a result where the model wrongly predicts a positive class, "Precision" which is associated with reproduction and repetition and is described as a degree that is measured repeatedly under unchanged conditions given in Eq. (21).
The "Recall" is the relevant examples that are parts of the total amount actually recovered, given by The "F-measure" is premeditated based on the precision and recall, given in Eq. (23).
The "Receiver-operating characteristic" (ROC) is a graphical plot equal to the TP-rate and FP-rate of the rating due to different filtration thresholds. "Mean absolute error" (MAE) is a quantity used to measure the proximity of the predictions to the final result. "Root mean squared error" (RMSE) measures the pattern of deviations between the predicted values and the observed values. Lastly, the time complexity (T) is shown in Tab. 3. The ML-based diagnosis of COVID-19 accuracy of the considered classifiers, that is, DLJ4, MLP, SVM, RF, and NB, shows very high accuracy of 98.67%, 98.00%, 97.33%, 96.67%, and 96%, respectively, as indicated in Fig. 4   Correspondingly, the confusion matrix (CM) of the optimized statistical feature is shown in Tab. 4. The diagonal of the CM corresponds to the classification precision in the suitable classes, while other instances show them in other classes. This includes information, which is the actual and predictive data for the DLJ4 classifier. Hence, the DLJ4 classifier shown better overall accuracy than the implemented classifiers.

Exponentiated transformed sine G Family
We now propose a complementary study providing a distributional approach to fit modern data sets such as those derived from the COVID-19. Recently, several generalized families (G) of continuous distributions have been proposed. They are based on the following principle: make more flexible a parent distribution by transforming the corresponding cumulative distribution function (CDF), involving one or more new parameters. Here, we define a new G family by the following CDF and probability density function (PDF), respectively:

Application of ETSEx Distribution on COVID-19 Datasets
We now apply the ETSEx model to fit data COVID-19 confirm cases (I), recover (II), and non-recover (III) cases in Pakistan from 24 March 2020 to 01 May 2020. This period corresponds to the so-called "first wave." We thus assume that the considered variable is continuous which is acceptable since a wide range of values are observed, and provide a new statistical model that can be useful for the following points: (i) Doing prediction for a pandemic with similar features and under similar conditions (comparable populations, comparable ecosystems…), (ii) Proposing an efficient model for fitting data of COVID-19 in other countries, (iii) Comparing the evolution of the COVID-19 disease in Pakistan with those in other countries. The dataset is obtained from the COVID-19: health advisory platform by the ministry of national health services regulations & coordination public database (http://covid.gov.pk/stats/pakistan). We compare the adjustment of the ETSEx model with the one of the standard exponentials (Ex) model [29]. As first analysis, descriptive statistics are given in Tab. 5 below. The model parameters are estimated via the maximum likelihood method (with the so-called BFGS algorithm) and the R software [30] is used for all the computations. The MLEs and the corresponding standard errors (SEs) for all the model parameters are given in Tab. 6 below.  The results of Tab. 6 are clear: Having the smallest values of −ρ, AIC, BIC, W * , A * , KS, and the greatest KS P-value, the ETSEx model is the best than the exponential distribution, Fig. 6 shown below also supports this claim. The results of the fit are in favor of the ETSEx model. This motivates its use for similar analyzes in other countries, modestly hoping that pandemic specialists can take advantage of this model.

Conclusions
The main aim of this research is the automated segmentation of lung CT images for the diagnosis of COVID-19 using machine learning methods. For this purpose, we collect a CT image dataset of lung disorders and divide it into three classes (Normal, Pneumonia, and COVID-19). The CT images dataset is collected from two different sources. The first source is the Radiology Department of Nishtar Hospital Multan and Civil Hospital Bahawalpur, Pakistan. The second source is a publicly free available medical imaging database known as Radiopaedia. At a preprocessing stage, CT images are transformed into a grayscale 8-bit format, dividing the image into four equal segments and extracting the exact part of the lung for observation. For automated segmentation, a novel fuzzy c-mean automated region-growing segmentation (FARGS) is employed. After that, hybrid statistical features are extracted from the segmented region. The chi-square feature reduction technique is employed to optimize the dataset. Lastly, the considered ML classifiers, that is, DLJ4, MLP, SVM, RF, and NB, present a significantly very high accuracy of 98.67%, 98.00%, 97.33%, 96.67%, and 96%, respectively. It has been observed that DLJ4 shows very promising accuracy as compared to the other employed classifiers. The article ends with some contributions in statistical modeling on data of importance on the COVID-19, which can be of independent interest. This novel research aims to help the radiologist to the automated segmentation of lung CT images and early diagnosis of COVID-19.