Deep Stacked Ensemble Learning Model for COVID-19 Classification

: COVID-19 is a growing problem worldwide with a high mortality rate. As a result, the World Health Organization (WHO) declared it a pan-demic. In order to limit the spread of the disease, a fast and accurate diagnosis is required. A reverse transcript polymerase chain reaction (RT-PCR) test is often used to detect the disease. However, since this test is time-consuming, a chest computed tomography (CT) or plain chest X-ray (CXR) is sometimes indicated. The value of automated diagnosis is that it saves time and money by minimizing human effort. Three significant contributions are made by our research. Its initial purpose is to use the essential finetuning methodology to test the action and efficiency of a variety of vision models, ranging from Inception to Neural Architecture Search (NAS) networks. Second, by plotting class activation maps (CAMs) for individual networks and assessing classification efficiency with AUC-ROC curves, the behavior of these models is visually analyzed. Finally, stacked ensembles techniques were used to provide greater generalization by combining finetuned models with six ensemble neural networks. Using stacked ensembles, the generalization of the models improved. Furthermore, the ensemble model created by combining all of the finetuned networks obtained a state-of-the-art COVID-19 accuracy detection score of 99.17%. The precision and recall rates were 99.99% and 89.79%, respectively, highlighting the robustness of stacked ensembles. The proposed ensemble approach performed well in the classification of the COVID-19 lesions on CXR according to the experimental results.


Introduction
The coronavirus (COVID- 19) was first noted in December 2019 in Wuhan City (Hubei, China). The viral infection quickly spread worldwide, eventually causing a global pandemic. Following a detailed study of its biological properties, the virus was found to be of zoonotic origin and consists of a single-stranded ribonucleic acid (RNA) genome with a strong capsid. Based on this survey, it was concluded that the virus belongs to the coronaviridae family and was subsequently named 2019-novel coronavirus (nCOV). A person infected with 2019-nCoV may have no symptoms or develop mild symptoms, including sore throat, dry cough, and fever. If the human body hosts the 2019-nCoV for a long period, the virus can cause severe respiratory illness and, in the worst case, it can lead to death. There are four stages that are used to assess the virus's virulence in the human body. During the first four days of the infection, the patient is often asymptomatic. The second stage is the progressive stage which generally occurs between the fifth and eighth day following the infection, whereby the patient may develop mild symptoms. Stage three is known as the peak stage, which occurs between nine and thirteen days. The final stage is the absorption stage, whereby the load of the virus exponentially increases [1]. These observations were reported with clinical experimentation in Fig. 1 [2].

Figure 1:
An upsurge in the number of cases and death rate from January to July 2020 is depicted. The infection and death rate increased by approximately 10 5 within six months Due to the rapid surge in cases, healthcare systems are finding it increasingly difficult to cope with the demand and to provide timely vaccination [3]. This problem is being further exasperated by the shortage of medical supplies globally. In order to reduce the burden on healthcare systems, several preventive measures such as social distancing, proper sanitization, the mandatory wearing of masks in public places, and lockdowns have been implemented worldwide to reduce the spread. Despite the implementation of all these measures, the mortality rate from the disease is still high in various countries. According to the Chinese National Health Commission (NHC), as of February 4 th , 2020, the mortality rate from the disease was 2.1% in China and 0.2% outside of China. The mode of spread of the virus in asymptomatic cases remains controversial [4,5]. In order to identify COVID-19 in an asymptomatic person, precise and proper diagnostic tests are required. The diagnostic tests are typically performed by collecting samples from the individual patient for testing in a laboratory or at a point of care testing center [6]. Manual testing is time consumes and labor-intensive. Therefore this method is not suitable to obtain a fast diagnosis during a pandemic. Computed tomography (CT) and chest X-ray (CXR) can be used to detect and assess the severity of the lung damage caused by the viral infection. However, a radiologist needs to analyze these images manually, which is time-consuming. Artificial intelligence (AI) can be used to develop algorithms to automatically assess the lung damage caused by the virus [2,7]. The findings for the COVID-19 infection in CXR or chest CT vary from person to person. However, two common hallmark imaging features observed in infected patients were bilateral and peripheral ground-glass opacities and peripheral lesions with a rounded morphology [2]. These distinct features facilitate the use of machine vision learning models to automatically detect COVID-19 lesions on either CXR or CT images. However, traditional methods do not preserve the contextual information of CT scan images. In view of this, this study aimed to develop a robust diagnostic model for COVID-19 detection on CXR images. The objectives of this study were to: • analyze the behavior and performance of various vision models ranging from inception to Neural Architecture Search (NAS) networks followed by appropriate model finetuning, • visually assess the behavior of these models by plotting class activation maps (CAMs) for individual networks, • determine the classification performance of the model by calculating the area under the curve (AUC) of a receiver operator curve (ROC), • improve the generalization of the model by combining the finetuned model deep learning with the shocked model (stacked ensembles technique).

Previous Works
Numerous studies evaluated the use of deep learning methods for the automatic detection, classification, feature extraction, and segmentation for COVID-19 diagnosis from CXR and CT images. This study discusses the relevant applications of pre-trained deep neural networks that prompt the key aspects to impact COVID-19 detection and classification. Fan et al. [8] proposed the use of the deep learning network Inf-Net for the segmentation of COVID-19 lesions on transverse CT scan images. This network architecture utilized Res2Net as a backbone and obtained a dice score of 0.682. A similar semi-Inf-Net model attained a higher dice score of 0.739. Oh et al. [9] implemented two different approaches, global patch matching, and local patch matching, for segmentation and classification. Their method used ResNet-18 as the backbone to classify four different types of lung infections similar to that of COVID-19. Their algorithm obtained an accuracy score of 88.9% and specificity of 0.946 on randomly cropped patches using a local approach. Rahimzadeh et al. [10] constructed the 8-phase training concatenating Xception and ResNet-50 architectures. In each phase, samples were trained using a proper stratification to overcome class imbalance for 100 epochs. This model attained an overall accuracy score of 91.4% by five-fold cross-validation. Ozturk et al. [11] proposed a Dark-CovidNet model for binary and tri-class classification of CXR images infected with COVID-19. This model was trained by constructing a deep neural architecture with a series of convolutional layers and max-pooling layers. This method attained accuracy scores of 98.3% for the binary classification and 87.2% for the tri-class classification on five-fold cross-validation. Apostolopoulos et al. [12] applied transfer learning using diverse pre-trained architectures on two different datasets for the classification of COVID-19 CXR images. Their transfer learning methodology attained an accuracy score of 98.75% using VGG-19 pre-trained weights for binary classification and an accuracy of 94.7% for the MobileNet-V2 CXR images classification consisting of three classes. Li et al. [13] proposed the CovNet network by training a deep learning model with ResNet-50 as a backbone for sharing weights and attained an accuracy of 96%. Khan et al. [14] designed the CoroNet-architecture with Xception as an underlying weight-sharing model. This model achieved an accuracy score of 99% through binary classification, 95% when using three non-identical classes (one class belonging to , and 89.6% for four variant classes following a four-fold cross-validation framework. Wang et al. [15] proposed the use of COPEL-Net to segment COVID-19 pneumonia lesions from CT images. The novel dice loss combined with a MAEloss for generalization was used to reduce noise and minimize the foreground and background imbalance for the segmentation task. This diagnostic frame obtained a dice score of 80.72 ± 9.96. Most COVID-19 classification and segmentation on CXR and CT images described in the literature are based on deep neural networks. The advantage of deep neural networks is that they provide a versatile weight-sharing mechanism, thus improving the performance of the algorithm. Therefore, this study aimed to develop a robust diagnostic COVID-19 model using CXR images. The objectives of the study were to: • examine the behavior and efficiency of different deep learning vision models ranging from Inception to NAS networks, using the proper finetuning procedure, • visually assess the behavior of these models by plotting class activation maps (CAMs) for individual networks,

Dataset Description
A total of 2905 CXRs were obtained from various databases, including the Italian Society of Medical Radiology (SIRM), ScienceDirect, The New England Journal of Medicine (NEJM), Radiological Society of North America (RSNA), Radiopaedia, Springer, Wiley, Medrxiv, and other sources Fig. 2. The complete source list of the COVID-19 CXR image samples is available in the metadata file [16]. These images were reviewed by an expert radiologist. Eight percent (n = 219) of these images were from patients infected with COVID-19%, 46% (n = 1341) of the images were from healthy persons, and the rest of the images were from patients suffering from either bacterial or viral pneumonia (n = 1345) [16]. The data was then divided into 75% training (D train ) and 25% testing datasets (D test) . Due to the small number of CXRs with COVID-19 lesions, stratified random sampling was used to ensure that all three diagnoses were equally represented in both training and testing datasets and hence minimize the risk of introducing class imbalance in the data distribution.

Convolutional Neural Networks for Feature Extraction
Convolutional neural networks (CNNs) are increasingly being used in computer vision to detect, classify, localize, and segment normal and pathological features from medical images [17]. The use of CNN increased widely following its application in large-scale image recognition challenges (ILSVRC-2010). In this challenge, AlexNet [18] made use of a deep CNN and resulted in the lowest detection error rate. This motivated researchers to make use of this technology to develop multidisciplinary high-end applications [19]. The CNN architecture can be modified significantly by manipulating the width, depth, and channels (activation-maps) to further improve the performance of the model with appropriate generalizations. Furthermore, the model's performance can be further improved through the manipulation of parametric weight sharing from one network to another network. This technique facilitates the feature extraction procedure in most networks, eventually reducing the computational and training cost [20,21]. Following the successful implementation of AlexNet, numerous other CNNs were developed. In the following section, the advantages and limitations of each CNN are discussed.

Inception
The Inception architecture is designed with a novel ideology module. This network architecture is trained by widening layers to increase the depth of the network depth with a few computational parameters. There are two versions of the architecture, including a naive and a dimensionality-reduced. The Inception module consists of three levels. The bottom levels of inception feed into four different layers stacked by width. The intermediate layers extract spatial information individually and correlate with each layer. The top layer concatenates all the intermediate layer's feature maps to maintain a hierarchy of features to improve the perceived performance of the network [22].

VGG-Nets
After Inception, VGG networks were developed by a sequential convolutional layer with a pooling layer. The sequential depth of the models ranged from 11 to 19 layers. The appropriate use of the max-pooling layers in 16 and 19 layered VGG-Nets is essential for spatial sub-sampling and the extraction of generic features at the rearmost layers. VGG-Nets use small receptive fields of 5x5 and 3x3 to capture small features, eventually improving their detection precision accurately. The generalizability of the model for highly correlated inputs can be further improved by finetuning the learning application schedules to decrease the learning rate [23].

Res-Nets
The Res-Nets were developed to address the problem of vanishing gradients by imparting identity mapping in large-scale networks. They reformulated deep layers by aggregating learned activations from a prior layer to form a residual connection. This residual learning minimizes the problem of degrading and exploding gradients in the deeper networks. These residual connections help in addressing learned activations from preceding layers, maintaining a constant information flow throughout the network, and eventually reduce the computational cost [24][25][26].

Inception-Res-Nets
This network was inspired by the Inception network modules and identity mappings from ResNets. This method integrates dimensionality-reduced Inception modules with sequential residual connections hence increasing the learning capability of the network while reducing its computational cost. This provides better generalization ability when compared to various versions of the ResNet and Inception Networks [25].

Xception
This network was proposed to compete with the Inception network to reduce its flaws. The simultaneous mapping of spatial and cross-channel correlations guides allows for improved learning with small receptive fields and improves perceptive ability. The depth-wise separable convolutional layers enhance the learning through detailed feature extraction. These networks are computationally less expensive and perform better than the Inception network [27].

Dense-Nets
These densely connected CNNs are motivated by the residual connection of Res-Nets and imposed long-chained residual connections to form dense blocks. In Dense-Nets, for N layers, there are N(N+1)/2 connections (including residual connections) that enhance the network's capability for extracting detailed features while reducing image degradation. The sequential dense and transition blocks provide a collection of knowledge, and a bottleneck receptive field of 3x3, eventually improving its computational efficiency. The finetuning of larger weights improves generalization in deeper networks with a depth ranging from 121 to 201 layers [28].

Mobile-Nets
Mobile-Nets were designed for mobile applications under a constrained environment. The main advantage of this network is the combination of inverted residual layers with linear bottlenecks. The constructed deep-network accrues a low-dimensional input, which eventually expands by elevating dimensional space. These elevated features are filtered via depth-wise separable CNNs and are further projected back onto a low-dimensional space using linear CNNs. This contribution reduces the need to access the main mobile application memory, thus providing faster executions through the use of a cache-memory [29].

Nas-Nets
Nas-Nets make use of convolution cells by learning from distinct classification tasks. The design of this network is based on a reduced depth-wise stacking of normal cells, hence providing an appropriate search space by decoupling a sophisticated architectural design. This adaptability of Nas-Nets enables it to perform well even on mobile applications. The computational cost is significantly reduced, and its performance can be improved by enhancing the depth [30].

Deep Stacked Ensemble Method
This deep-stacked ensemble method was evaluated by classifying COVID-19 database inputs into a tri-class and a binary class, as shown in Fig. 3. Various samples were first considered and pre-processed to a specific resolution of 224×224×3 of the COVID-19 dataset. These pre-processed images were then fed into a variety of deep networks that use different paradigms to extract features from latent dimensions. The extracted feature vectors are then evaluated, and the two best-performing models are selected to form a stacked ensemble. The COVID-19 class is given more weight in this ensemble, which was assessed by classifying the feedback into a tri-class and a binary class.

Finetuning of Neural Networks
Deep learning algorithms can accurately detect pathology from bio-medical imaging to human-level precision. The CNNs provide numerous advantages for feature detection in medical imaging. There are two methods that can be used to design neural architectures for medical imaging. The first method involves designing a novel architecture by overhauling loops in existing architectures by training it end-to-end. The second method involves model finetuning by either transferring the weights of a pre-trained model (transfer of weights) or by retraining an existing pre-trained architecture.
The training of an end-to-end CNN requires proper initializations, which can be computationally expensive. On the other hand, the transfer of weights from the pre-trained models for a similar problem statement can be useful to reduce the computational cost. However, they may not extract the invariances if the class samples in the problem statement are not trained at least once. For example, a pre-trained network on Imagenet may not be able to extract the invariances in CXRs if these samples are never seen or trained. This means that the model may end up capturing unwanted features on the CXR, leading to an inaccurate classification. In order to overcome this problem, the model is fine-tuned to obtain the appropriate features. Fine-tuning of the model is extremely important in medical imaging when the sample size is small, leading to class imbalance [31]. Hence, the existing models, starting from VGG-Nets to Dense-Nets, were all finetuned to extract invariant features and discriminate the COVID-19 class from the remaining. The fine-tuning for individual models were performed as per Algorithm-1. The major parameters considered for finetuning in our methodology were learning schedules and batch sizes. The algorithms were finetuned by constricting the noise caused during the training process to reduce the risk of misleading the model if not trained with appropriate initializations.
The D train and D test samples were inserted into each model to capture latent feature vectors. A feedforward neural network was built to classify the extracted feature vectors, and all models were fine-tuned using Algorithm 1. The final extracted feature vector consisted of different threedimensional shapes according to the model. These latent representations were then classified by attaching a dense layer consisting of 256 neurons followed by dropout [32] and batch normalization [33] of the layers for regularization. The final layer consisted of a softmax activation layer with "c" neurons, whereby "c" represents the number of classes. The dropout percentage was set to 30%. A generalization assessment was performed for all individual models. ReLU [34] was used for the non-linearity construction of the model architecture for all the layers except the final layer, whereby feed was forwarded by softmax. Glorot-normal was used for the initializations of most of the layers [35]. The initializations with appropriate activations resulted in the extraction of the following intricate, deep feature layers.
All models were carefully finetuned, and their performance was evaluated using various performance metrics. The generalizations provided by the finetuned models are summarized in Tab. 1. All the models performed well and had a similar overall performance Tab. 2. The classwise performance of the model is also summarized in Tab. 2. The classes are coded as C-0, C-1, and C-2, indicating COVID-19, normal, and pneumonia, respectively.
In the design of medical diagnostic prediction models, receiver operating characteristics (ROC) analysis is essential for analyzing the model performance. The area under the curve (AUC) of the ROC of a classifier determines the diagnostic stability of the model. This AUC-ROC curve is insensitive to the alterations in the individual class distributions [36]. A ROC curve for each model was therefore plotted, as shown in Fig. 4. The feature extraction ability of models varied widely, as not all models were capable of recognizing features pertaining to COVID-19 lesions.
A prediction model for medical imaging needs to have a high sensitivity and specificity. A clinically useful COVID-19 model based on CXR needs to be able to differentiate between COVID-19 from other infections. However, the distinction of CXR lesions caused by COVID-19 as opposed to other infections can be quite challenging. CAMs were therefore applied to all CXR input images [37]. CAMs apply global average pooling for bottleneck activations in CNNs and provide a visual understanding of discriminative image regions and/or the region of interest. CAMs provide a visual illustration through the use of heat maps of the features extracted by the models to make predictions. Therefore CAMs provide a clear understanding of whether the acquired features are distinctive of a COVID-19 lesion, as illustrated in Fig. 5.
The CAMs analysis shows that some of the models extract the peripheral and bilateral ground-glass opacities while some of the other models also extracted the rounded morphology typical of COVID-19 lesions [38]. Since both features were deemed essential for an accurate diagnosis, the models that provided the highest generalization and extracted different features according to the CAMs analysis were used to develop the neural model averaging or neural stacked ensembles models.

Model Averaging
Model averaging is the process of averaging the outcomes of a group of networks trained on a similar task or the same model trained on different parameters. The model averaging improves the generalization of the models by aggregating their predictions. The generalization for the model was obtained by minimizing the loss during stochastic optimization using equation Eq. (1), whereby x and y are features and ground truth class labels of particular data distribution. If f n is an n th neural architecture that predicts the class label for a given feature set (where, n = 1, 2 . . . N), the mean squared error for the loss function can be minimized as follows: where f N represents the final neural architecture.
Similarly, weights can be assigned to individual models based on their prediction performance. These weights are then applied to the appropriate models to obtain aggregated generalization. This is known as weighted model averaging. In the case of model averaging, the models are equally treated by assigning the individual performance of the model to each network. This means that the weighted model averaging provides importance to the required models and discards the poorly performing models.
The generalization provided by the committee of the neural models improves when compared to that of model averaging and weighted model averaging. Hence, the models were stacked to improve the generalization ability of the model.

Stacked Ensembles
The stacked ensemble integrates or groups different models to provide aggregated generalization by mapping the output predictions onto a logit function. Instead of averaging the weights to the grouped models, logistic regression or multi-class logit was applied to map the predictions. Therefore, the predictions were gathered, and a logistic regression was applied to them or built at the end-to-end neural model that applies softmax non-linearity as final activation [39,40]. The generalization improvements provided by the stacked ensemble (using neural networks) were mathematically described as follows.
Our network was first considered to be a function that predicts a certain input x, where our true function is T i (x) and approximated function is f (x) ∀i = 1, 2, 3, . . .n. Suppose, where r i is the generalization error ∀ i = 1, 2, 3 . . .n , whereby n represents the number of neural networks to an ensemble. where So, the average individual error settled from the networks can be estimated as follows: The ensemble learning of the grouping variant networks is presented in the following equations: Estimated error resided by stacking these ensembles: Suppose, From Eqs. (13) and (8) If individual networks did not correlate themselves, the stacking ensemble was reduced by n factors using the original generalization attained from the individual networks. However, for most scenarios, a correlation in generalization occurs, leading to an increase in the generalization error to a certain extent.
To understand this scenario, r ij = 0 is considered. So, This constant 'ε' is an additional error caused due to covariances underlying the perception of individual networks.
With this knowledge, it is clear that the stacked ensembles can outperform single networks in terms of generalization. As a result, six different neural network committees were formed by multiplying the number of neural networks described in Tab. 3, which ranged from 2 to 13 networks (all). These ensemble networks were evaluated using the standard classification metrics. A small neural architecture was attached to the committee of networks to adjoin the connected layer fully. This fully CNN consists of 16 neurons with a dropout of 30% for regularization. The final activations were pushed with softmax non-linearity, which consisted of three neurons describing the class predictions pertaining to each individual class. The results obtained by the proposed stacked ensembles are described in the next section in detail. Stacking all the models in Table. 1

Results
As mentioned, the generalization error obtained by a committee of neural networks is always less to that of a single neural network. Six variant committees of networks were selected and combined as described in Tab. 3. The classwise classification metrics utilized to understand the behavior of a specific COVID-19 class are illustrated in Tab. 4.
A comparative study was then performed to compare the performance of the proposed network with other existing models described in the literature Tab. 5.
Our designed generic training algorithm facilitates the training process by acquiring faster convergence and with low computations (Iterations). During the training process, the batch size and learning rate are increased cautiously for each iteration to obtain a balanced criterion, as explained by Smith et al. [41]. As noise during training can be reduced by properly choosing the batching parameters fed into the network, the learning rate and momentum of the optimizer were assigned a faster search.   The noise due to training is theoretically represented as follows in Eq. (18).
Here we assumed a constant momentum. A training algorithm was developed to conceptualize the noise constraint. Although a decaying learning rate can decrease the noise, it gradually increases the computational time for training. On the other hand, lowering the batch size can also reduce noise but comes at the cost of lowering the generalizing capacity of the model. These problems were overcome by developing an algorithm that increased the batch size during the specified iteration and cautiously increasing the learning rate as follows. The algorithm was first iterated for 16278 steps (iteration 1), whereby the learning rate was set to 10 −4 by sending 15 samples as a batch at a time. In the next iteration (iteration 2), the batch size was increased by 50%, and the learning rate was increased tenfold. In order to maintain a consistent trade-off between generalization and faster convergence, the batch size was increased by a factor of 150% (to that of initial), and the learning rate was tuned as per the preceding iteration. During the experimentation, it was found that the proposed training procedure led to a faster convergence by training using only a few steps (approximately 20 epochs).
Appropriate training with fine-tuning of the ensemble was therefore critical to obtain these insightful outcomes. The final Ensemble-6 model had the highest performance when compared with the other method, with an accuracy score of 99.175%. Ensemble-1 and Ensemble-2 attained an accuracy of 98.487% and 98.762%, respectively. When taking into consideration only the COVID-19 class, the precision rate was at least 97.674%, but the recall rate was lower. The highest and lowest recall rates were 89.795% and 69.387% and were obtained by Ensemble-6 and Ensemble-4, respectively. However, due to the small sample of the COVID-19 class in our study, it was difficult to extract additional invariant features to improve the performance of the model further.

Limitations
In this study, we observed that the stacked ensemble was slightly inefficient when a poorperforming model was included. The DenseNet-201 model evaluations were not always finetuned correctly, and the network depth was not always appropriate, leading to a high generalization error. The COVID-19 results were not always included in the single model based on the features derived from the individual models. The ensemble method offers more generalization, but the combination of multiple models increased the computational cost, which is unnecessary for smallscale computational systems (such as Ensemble-6). As a result, in real-world scenarios, small, quick, and efficient models such as Ensemble-1 and Ensemble-2 are advantageous. The progression of the virus can be visualized better on Chest CT axial images. However, there is a chance of missing disease progression on CXR [38], which could be dangerous. Therefore future studies should focus on the development of models that can predict disease progression on CXR.

Conclusion
In this study, various COVID-19 classification models were evaluated and compared using different classification metrics. Furthermore, a learning framework for finetuning these models was proposed, and their bottleneck activations were visualized using CAMs. The AUC-ROC curves were closely examined, and the output of each class was illustrated visually. These finetuned models were then stacked to outperform previous models and include a broad range of generalizations. The ensemble models achieve an accuracy score of 97.66 percent in the worstcase scenario. Even after finetuning for class imbalance, the models were found to have a high generalization ability. The least error rate obtained by the outperforming model, built by stacking all the finetuned models, was 0.83%. The stacked ensembles method improved the performance of the model and could therefore be used to improve the prediction accuracy of the diagnostic models in medical imaging.