COVID-19 cases prediction in saudi arabia using tree-based ensemble models

COVID-19 pandemic has affected more than 144 million people and spread to over 200 countries. The prediction of COVID-19 behaviour and trend is crucial to prevent its spreading. Kingdom of Saudi Arabia (KSA) is Asia’s fifth largest country, and it hosts the two holiest cities of the Islamic world. KSA hosts millions of pilgrims every year, and it is of great importance to predict the COV-ID-19 spread to organize these religious activities and bring life to normality in KSA. This study proposes four tree-based ensemble methods to predict the COV-ID-19 daily new cases in KSA. Tree-based ensemble methods are suggested to reduce the variance and/or bias of inconsistent models. The four models utilized in the study are Gradient Tree Boosting (GB), Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Voting Regressor (VR). The study is conducted using “Our Data in World” (OWID) COVID-19 dataset from the first confirmed case in KSA, i.e., 2nd March 2020 to 14th April 2021. The results suggest that the tree-based ensemble models provide a good prediction of daily COVID-19 new cases and can follow the trend of COVID-19. Among the models, XGBoost and VR performed better than the other three models with the best evaluation metric scores (MAE:4.41, RMSE:7.11, MAPE:0.95%). The significant prediction power of the tree-based ensemble methods, especially XGBoost can provide the platform for policymakers to put strategic plans for the closure periods of the educational institutions and organize Hajj and Umrah. © 2022, Tech Science Press. All rights reserved.


Introduction
SARS-CoV-2 virus, formally known as COVID-19, surfaced in Hubei, China in December 2019. By the end of February 2020, the COVID-19 cases dramatically rose to a staggering 80,000. The COVID-19 virus rapidly spread all over the globe. As of April 2020, the number of COVID-19 cases has crossed 144 million worldwide, with over 3 million confirmed deaths. COVID-19 pandemic impacted virtually every field and industry in the world, including education, finance, travel, among others [1]. Many countries imposed various stringency actions to curb the COVID-19 virus's spread, which included lockdowns of entire countries and travel restrictions. Prediction can help plan the future [2], hence, it is imperative to predict the trend of COVID-19 to set up countermeasures and plan ahead for stringency measures [3].
Epidemic outbreak predictions like weather forecasts are subject to fundamental limitations [4]. One of the significant limitations is the relatively short epidemic time series, as it is challenging for the governments and stakeholders to carry out medical tests on a large scale. However, researchers around the world came together to research the various impacts of COVID-19. They developed many methods to predict the COVID-19 virus's spread. The method ranged from simple approaches of the sigmoid curve to complex machine learning and network-based prediction models [3,5,6]. The Statistical approaches used to estimate and predict the COVID-19 pandemic include the Bayesian approaches [7] and Kalman filtering [8]. Mathematical approaches included the parameter estimation on compartmental models such as Susceptible-Exposed-Infected-Removed (SEIR) model [9,10] or the SIR model [11,12]. Data scientists utilized various machine learning algorithms such as Deep Learning [13], Long Short-Term Memory (LSTM) [6,14], neuro-fuzzy inference [15,16], and decision tree-based algorithms [17]. Among the predictive models, deep learning, SIR, and SEIR are the most used models in COVID-19 prediction [18].
Similar to the world, the research regarding COVID-19 was also carried out for the Kingdom of Saudi Arabia (KSA). KSA took bold measures for social distancing regardless of the social, political, and economic and especially religious challenges [19,20]. Even adopting strict lockdowns and partial lockdowns, researchers recommend that stricter lockdowns will help curb the COVID-19 in KSA [21]. Researchers have also studied the impact of COVID-19 and the social restrictions that came with it in KSA on mental health [22,23]. It was concluded that the COVID-19 pandemic in KSA has substantially affected the quality of life and both the psychological and physical health of the population [24]. Current research work in the prediction of COVID-19 included using deep learning methods like ANN, RNN and LSTM [25][26][27], SEIR model [21,28], SIR model [29], Singular Spectrum Analysis [30,31] and Generalized Richards Model [32]. Most of these studies do not focus on the daily new cases, and if they do, they are using very limited datasets [25,31]. This work aims to predict the daily new cases of COVID-19. Our work can provide a helping hand to the government and healthcare organizations in realtime decision-making to curb the spread of COVID-19 epidemic. It can also provide the platform for policy makers to put strategic plans for the closure periods of the educational institutions and organize Hajj and Umrah.

Study Location
This study is conducted for the Kingdom of Saudi Arabia (KSA), also known as Saudi Arabia. Fig. 1 shows the location of KSA. Geographically, KSA is the 12 th largest country in the world, and it is Asia's fifth largest state [33]. KSA has a population of 34.8 million, with a population density of 15.32. KSA has a median population age of 31.9, making it among the world's youngest populace, and KSA has a life expectancy of 75.13 years [34].
KSA is the home of the two holiest cities of the Islamic world, i.e., Mecca and Madinah. Muslims around the world are obliged to pilgrimage to Mecca. According to the official reports, 2.48 million people from around the world pilgrimed to Mecca for the 2019 Hajj [25]. In 2019, 18.31 million people performed Umrah, and thirty per cent of the Umrah pilgrims were elderly, aged over 50 years [25].

Dataset
This study utilizes the COVID-19 dataset provided by Our Data in World (OWID) [34]. The OWID COVID-19 dataset is updated daily, and it includes data on confirmed cases, deaths and testing [34,35]. Feature engineering (FE) is one of the most essential and significant steps in data science and predictive research [36][37][38]. The feature engineering of the dataset is carried out using Microsoft Excel. OWID COVID-19 dataset has 59 parameters, but the parameters with constant data are removed during feature engineering process. The removed parameters included population, human development index, life expectancy etc. Twenty five parameters are selected for the current study using the exclusion criteria of constant data, empty data, and redundant data. Tab. 1 presents all the parameters used in the current study. The dataset is a time series dataset with a timestep of one day. The duration of the OWID dataset is from the first confirmed case in KSA, i.e., 2 nd March 2020 to 14 th April 2021. Parameters 1-24 are our independent parameters, and Parameter 25 (new cases) is our dependent parameter.

Methods
Some of the popular machine learning algorithms, such as Artificial Neural Network and Decision Trees, are considered inherently unstable. This is because these algorithms lead to significantly different predictions if there is any perturbation of the training dataset [39]. These predictor algorithms have high variance and low bias. Tree-based ensemble methods are suggested to reduce the variance and/or bias. In these methods, an ensemble of various base predictor models are created and joined together to form a single predictor as an ensemble model [40]. The ensemble methods are used in various research fields including big data [41][42][43], clustering [44], keyword extraction [45], text classification [46][47][48][49][50], prediction [51], and sentiment analysis [52][53][54]. In this study, we are using four tree-based ensemble models, i.e., Gradient Tree Boosting (GB), Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Voting Regressor (VR).

Gradient Tree Boosting
Gradient Boosted Decision Trees or Gradient Tree Boosting (GB) is a decision tree-based ensemble method and is considered one of the most versatile and effective techniques for building predictive models [55]. GB generalizes the boosting of arbitrary loss functions and is considered an effective and accurate method suitable for both classification and regression problems. Fig. 2 presents the working of GB. Multiple sequential regression trees are chained together iteratively, so each tree is trained on the residuals of the previous tree in the loop, and at every step, a new learner is included to reduce the loss function optimally. An additive model is used to combine these trees, creating a stronger tree-based ensemble model.

Extreme Gradient Boosting
Extreme Gradient Boosting, commonly known as XGBoost, is an implementation of GB, which is designed to prevent overfitting and enhance performance and speed [56]. XGBoost was designed to be a scalable end-to-end method and adapt to the available resources to make the best use of them during the training phase. XGBoost is used by data scientists in many machine learning challenges to obtain state-of-the-art results [55].

Random Forest
Random decision forests or Random Forests (RF) are widely used decision tree-based learning algorithms. RF is used for regression and classification problems in machine learning. Leo Breiman developed the algorithm in 2006 [57]. He proposed a method of building a forest of uncorrelated trees using a procedure similar to classification and regression trees and included bagging and randomized node optimization. Fig. 3 presents the working of the RF as multiple trees are trained on slightly differing training data and are combined into a stronger model, whose prediction by committee is more precise than any individual decision tree in the RF.

Voting Regressor
The concept behind the Voting Regressor (VR) is very simple and intuitive, to combine various machine learning models and use average predicted values or use majority voting to return the final predicted value. Fig. 4 shows the general working of the VR. VR is very useful for a set of models which are equally well-performing. VR will help to balance out their individual weaknesses and predict more accurately.

Results and Discussion
The COVID-19 dataset used in this research is a time series dataset with a timestep of 1 day. As presented in Tab. 1, the dataset contains an interesting parameter called stringency index. Stringency index is a composite measure. It is calculated based on nine response indicators, including travel bans and the closing down of schools and workplaces. Fig. 5 presents a comparison between normalized daily new cases and stringency index.
The data presented in Fig. 5 shows a mixed yet interesting comparison on stringency index and daily new cases. It clearly shows a sharp increase in daily cases just after the stringency index dropping at the start of June 2020. Furthermore, maintaining the stringency index from June 2020 to August 2020 corresponded with the consistent drop in the daily COVID-19 cases going forward. Fig. 6 presents the Spearman correlation for the stringency index and daily new cases. We are using the spearman correlation as they are considered more robust and appropriate for time series data. The correlation between stringency index and daily new cases is 0.59, which is a positive and significant value. Based on the trend and correlation of stringency index and daily new cases, it can be concluded that the stringency measures put in place by the government of KSA had a positive impact on limiting the daily new cases.

Prediction of Daily New Cases
The models utilized in this research use tree-based ensemble models. The ensemble model's goal is to join a number of estimators to improve the performance, estimation power and generalizability. The experimentation results show that the tree-based ensemble models used in the current study can predict the daily new cases based and follow the trend of the daily cases. Fig. 7 presents the daily new cases predictions by models used in the study. The predictions are completed on the testing dataset (80%-327 days) and tested on the test dataset (20%-82 days). The prediction results show the significant prediction power of the tree-based ensemble methods. All tree-based ensemble models performed well, with XGBoost performing the best among them. It is also evident by looking at the results that the tree-based ensemble models can follow the curve of the COVID-19 daily new cases.  The values presented in Tab. 2 present a comparison between the evaluation matrices. We are utilizing three evaluation matrices. First, Mean Absolute Error (MAE) is a quadratic scoring that computes the average errors, and its computation does not involve the polarity of the errors, i.e., positive or negative. It the absolute differences between the real data and predicted data using a test sample, giving all the same weight differences. As it represents the differences, lower values are considered better for MAE. Second, Root Mean Squared Error (RMSE) is a quadratic scoring metric, similar to MAE. RMSE determines the magnitude of average error. Similar to MAE, RMSE is a negatively oriented score, which means the lower value for RMSE is considered better. RMSE is deemed the key criteria for the predictive models. The RMSE results show that XGBoost model's prediction was more accurate than the other models. Other models, i.e., Gradient Boosting, RF and Voting Regressor, also performed relatively well.
The MAE values for tree-based ensemble models are presented in Tab. 2. It is beneficial to compare MAE and RMSE in predictive models, where sizable errors are unwanted, as RMSE gives more weight to largest errors. The comparison for RMSE and MAE shows that the models used in the study do not have large residual errors, and the XGBoost model performed very consistently with a very small difference between MAE and RMSE.
The third performance evaluation metric used in the study is MAPE. MAPE shows the accuracy in the form of error percentage. As it is expressed as a percentage, it is easily interpretable in comparison with the other evaluation matrices. The MAPE values presented in the show that the tree-based ensemble methods perform very well, with XGBoost performing the best with a MAPE score of 0.95%, owing to the conclusion that the tree-based ensemble models, especially XGBoost and VR, can predict the COVID-19 daily new cases accurately.
After analyzing the three performance evaluation matrices, i.e., RMSE, MAE and MAPE, the XGBoost algorithm is evident as the most efficient and accurate of the tree-based ensemble models utilized in the study. Admittedly, it should be mentioned that the other three tree-based ensemble models also performed well. The limitations of the study includes only using tree-based ensemble models and limited COVID-19 dataset, as the pandemic is still affecting the world. In the future work, we would like to compare the tree-based ensemble models with predictive and time series models such as Neural Networks [58] and LSTMs [59].

Recommendations for Decision-Makers
After examining the prediction results and evaluation matrices, it can be concluded that the tree-based ensemble methods can be used to predict the trend of COVID-19 daily new cases in KSA. An encouraging aspect of this research is that we have used different tree-based ensemble models, and all of the models were able to predict the COVID-19 daily new cases relatively well. We recommend that the decision-makers utilize the tree-based ensemble models to study and predict the daily new cases of COVID-19. The COVID-19 pandemic has also spotlighted the problem of easily spreadable misinformation regarding the disease and the pandemic [60,61]. Different researchers have emphasized the need to detect this misinformation and highlighted the issue of rapid change of information and available datasets limited to the English language [62,63]. The researchers also highlighted the need for adapted and novel natural language processing techniques to tackle misinformation in the COVID-19 pandemic, especially for languages apart from English.
The trend and correlation between the stringency index and the daily new cases clearly showed that the stringent measures taken by the government of KSA were influential in decreasing the COVID-19 daily new cases. Based on the correlation and trend, we can recommend combining the tree-based ensemble model prediction and stringency index. The decision-makers can devise an adaptable strategy to reduce the spread of COVID-19 in KSA and put strategic plans for the closure periods of the educational institutions and organize Hajj and Umrah.

Conclusion
This research utilized four tree-based ensemble methods to predict the COVID-19 daily new cases in KSA. The four models utilized in the study are Gradient Tree Boosting (GB), Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Voting Regressor (VR). The OWID COVID-19 dataset was used to train the models. The OWID dataset duration is from the first confirmed case in KSA, i.e., 2 nd March 2020 to 14 th April 2021. All tree-based ensemble models trained for predicting the daily new cases performed well, with XGBoost providing the best scores of MAE (4.41), RMSE (7.11), and MAPE (0.95%). The results show that the tree-based ensemble models, especially XGBoost can be used to predict the COVID-19 daily new cases accurately. Furthermore, the analysis of the stringency index and daily new cases show that the stringency measures put in place by the government of KSA had a positive impact on limiting the daily new cases. The obtained results of the current study can help the stakeholders put forward strategic plans to control the spread of COVID-19, organize the closure periods of educational institutions, and organize the 2020 Hajj pilgrimage.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.