An Intrusion Detection System for SDN Using Machine Learning

Software Defined Networking (SDN) has emerged as a promising and exciting option for the future growth of the internet. SDN has increased the flexibility and transparency of the managed, centralized, and controlled network. On the other hand, these advantages create a more vulnerable environment with substantial risks, culminating in network difficulties, system paralysis, online banking frauds, and robberies. These issues have a significant detrimental impact on organizations, enterprises, and even economies. Accuracy, high performance, and realtime systems are necessary to achieve this goal. Using a SDN to extend intelligent machine learning methodologies in an Intrusion Detection System (IDS) has stimulated the interest of numerous research investigators over the last decade. In this paper, a novel HFS-LGBM IDS is proposed for SDN. First, the Hybrid Feature Selection algorithm consisting of two phases is applied to reduce the data dimension and to obtain an optimal feature subset. In the first phase, the Correlation based Feature Selection (CFS) algorithm is used to obtain the feature subset. The optimal feature set is obtained by applying the Random Forest Recursive Feature Elimination (RF-RFE) in the second phase. A LightGBM algorithm is then used to detect and classify different types of attacks. The experimental results based on NSL-KDD dataset show that the proposed system produces outstanding results compared to the existing methods in terms of accuracy, precision, recall and f-measure.


Introduction
Web applications are becoming more widespread, and the internet has become an integral component of our everyday lives. As a result, there has also been increasing attention paid to the issue of network security [1]. The identification of anomalous network behavior is an important issue in network security research. Intrusion Detection System (IDS) are used to examine network data and identify anomalous network activities. IDSs are typically categorized into two types: signature-based and anomaly-based detection systems [2].
Signature-based Intrusion Detection Systems identify intrusion by constructing abnormal behavior character libraries and comparing network data. This approach identifies well-known attacks and has a low false alarm rate. However, this method is incapable of detecting zero-day attacks that are new and unknown. Anomaly-based intrusion detection systems build models based on typical network activity and identify intrusions depending on whether the behaviors deviate from the norm [3].
Software Defined Networking (SDN) is a topology which is dynamic, programmable, inexpensive and flexible, making it a good choice for today's high-bandwidth, dynamic real time applications. The network control and forwarding responsibilities are isolated in this topology, enabling network control to be easily configurable while the underlying infrastructure for applications and network services is hidden. Fig. 1 depicts a detailed description of the SDN architecture. The OpenFlow protocol [4] is a well-known standard that is utilized in the development of SDN systems. SDNs are being used in a variety of network applications, ranging from residential and corporate networks to datacenters. SDN characteristics assist in resolving various security challenges in a conventional network and provide us with the ability to govern network traffic at a fine-grained level. However, the SDN architecture offers additional attack vulnerabilities and security threats.
In SDN, Kreutz et al. [5] presented seven threat vectors. The control-data interface and the controlapplication interface are two that are unique to SDN and are connected to the controller. Controller is insecure because it creates a single point of attack and failure. Encrypting the communication connection with the Transport Layer Security protocol (TLS) is also ineffective and optional. In the SDN architecture, several attacks are possible. SDN security is a serious challenge due to the broad diversity types of SDN deployment. To secure SDNs, different machine learning algorithms has been used.
Machine learning is a field in computer science which originated from the pattern recognition and computational learning theories of artificial intelligence. It deals with the design and development of algorithms for learning and forecasting data. When the training data is labelled, machine learning is classed as supervised learning, unsupervised learning is when the training data is unlabeled, and semi-supervised learning is when the training data is a mix of labelled and unlabeled data [6]. Decision Tree (DT), Random Forest (RF), XGBoost, K-Nearest Neighbor (KNN), K-Mean Clustering, Support Vector Machine (SVM) and Ensemble Methods are some of the most commonly used techniques for IDS.
In order to facilitate the comprehension of the proposed work, the rest of the paper is structured into different sections. Section 2 illustrates the related work for SDN that is based on machine learning algorithms. Section 3 depicts the proposed HFS-LGBM IDS. Section 4 discusses and evaluates the results of the proposed system using various evaluation metrics. Finally, Section 5 summarizes the conclusions and makes recommendations for further research.

Related Works
Machine learning techniques have become an essential component of attack security measures. These machine learning-based technologies are capable of distinguishing between normal and attack traffic with extreme accuracy. There are multiple research projects that provide attack detection and prevention methods against various SDN threats.
Singh et al. [7] suggested a machine learning-based DDoS defense technique in SDN. The developed system is divided into three modules: flow statistics collection, feature extraction, training, and network traffic categorization. In this security system, several machine learning classifiers such as Support Vector Machine, Decision Tree, Logistic Regression and Nearest Neighbors are investigated for identifying normal and attack traffic. For attack detection, the classifier with the highest accuracy and lowest false positive rate is chosen. Chen et al. [8] presented an attack detection method based on the XGBoost classifier. By exploiting the controller's characteristics, the traffic collector and classification model were deployed in the controller. The purpose of the research was to resolve the treat concerns on the controller.
Researchers presented a crossbred IDS method in [9] that combines two different evolutionary algorithms such as Artificial Fish Swarm and Artificial Bee Colony. The crossbred method was used to produce anomaly detection criteria that are based on a limited subset of characteristics generated by the primary combined process. The inspired alternatives also outperform traditional and profound learning algorithms for detecting network threats, but comparisons are more difficult due to a lack of knowledge on methodologies and processes.
In [10], an extreme gradient boosting classifier was employed to differentiate between two types of attacks: normal and DoS. POX SDN, an open source SDN infrastructure for testing and developing SDNbased approaches, was used as a controller to analyse and verify the detection strategy. To improve computations by generating structure trees, the XGBoost term was added and integrated with the logistic regression technique. Two normalization techniques such as logarithmic and min-max were used. For an intrusion detection application, the SVM Classifier [11] was combined with the principal component analysis (PCA) technique. In this approach, the NSL-KDD dataset is employed to train and optimize the model for detecting anomalous patterns. To tackle the diversity data scale ranges with the fewest misclassification concerns, a Min-Max normalization approach was developed.
The researchers in [11] have proposed a framework that relies on voting based ensemble model for the attack detection. Ensemble model is a combination of multiple machine learning classifiers for prediction of final results. In the proposed work, three ensemble models such as Voting-CMN, voting-RKM and voting -CKM are proposed and analysed. The authors in [12] integrated the benefits of machine learning with intrusion detection systems to provide high detection rate and to defend networks from attackers. To detect attack anomalies, the presented system employs a Grid Search approach combined with SVM.
Their experimental research shows that they have made significant progress in identifying practically all conceivable network attacks in an SDN-based cloud environment.
In [13], new features are incorporated to build an ML-based model capable of detecting a DDoS attack on the SDN controller. In addition to traffic data, the additional characteristics are retrieved from packet headers. The experimental findings suggest that the proposed work can identify the attack rapidly. Sultana et al. [14] conducted a review on several current researches on machine learning algorithms that use SDN to construct NIDS. The study also included tools for developing NIDS models in an SDN environment.
The fundamental task of an intrusion detection system is to analyse huge amounts of network traffic data. To solve this problem, a well-organized classification system is required. Support Vector Machine (SVM) and Naive Bayes are two machine learning algorithms used in the proposed system [15] for resolving categorization issues. To increase the efficiency of identifying known and unexpected attacks, the research [16] develops a multi-level hybrid intrusion detection model that employs support vector machines and extreme learning machines.
A modified K-means technique is also proposed for creating a high-quality training dataset, which helps classifiers perform better. The modified K-means method is used to create new tiny training datasets that represent the full original training dataset, reducing classifier training time and improving intrusion detection system performance [17]. provides a complete analysis of machine learning-based intrusion detection algorithms that have been published in the literature. This study complements existing intrusion detection surveys and serves as a reference for researchers working on machine learning-based intrusion detection systems.

Proposed Method
This section discusses the proposed IDS to detect the attack in SDN. A high-level view of the deployment of HFS-LGBM based IDS in SDN architecture is presented in Fig. 2. The OpenFlow switches on the controller unit are normally managed by the SDN controller. The SDN controller has the ability to request all network data whenever it is needed. As a result, the proposed HFS-LGBM intrusion detection system is deployed in the SDN Controller.
The architectural diagram of the HFS-LGBM system is presented in Fig. 3. In this paper, the HFS algorithm that combine two feature selection algorithms such Correlation based Feature Selection and Random Forest Recursive Feature Elimination is proposed.

Data Preprocessing Phase
Data preprocessing is the most crucial task which simplifies and improves the efficiency of data mining methods. Data is usually derived from a variety of sources and can be noisy, excessive, incomplete or contradictory. As a result, it is necessary to transform unprocessed data into useful information for investigation and disclosure. The pre-preprocessing phases in this research include the following processes, which are detailed in the following sections:

Data Cleaning
To achieve accurate prediction, data cleaning is the process of eliminating or correcting inaccurate, duplicate, or incomplete records and filling missing values within provided datasets. To avoid deceiving the models during training, the white or blank spaces of multi-class labels are identified and removed.

Label Encoding
Several datasets have categorical variables, that are incompatible with most machine learning algorithms. As a result, it is necessary to convert these categorical into numerical values. Each categorical value is assigned an integer value using one-hot and ordinal encoding techniques.

Normalization
The effectiveness of classification models can be significantly influenced by feature imbalance scales. As a result, it's critical to standardize these discrepancies within the dataset's features such that both negligible and dominating values are within an appropriate limits. The minimum-maximum technique is used to systematically normalize dataset features within the normalized range of [0, 1], making the data easier to interpret. The minimum-maximum method's equation is as follows: where X' denotes the normalized result, X min and X max denotes the minimum and maximum value of feature X and the original data sample value to be normalized is represented by X.

Feature Selection Phase
The proposed HFS approach is described in this section. Fig. 4 depicts the flow diagram for the HFS approach.
The proposed HFS Approach is composed of two stages. The CFS is used to minimize the size of the feature set in the first stage by eliminating noisy irrelevant and redundant features [18]. The RF-RFE is used in the second stage to find the optimal feature subset from the reserved feature set.

Correlation Based Feature Selector (CFS)
One of the ways in machine learning is to select features for predicting outcomes based on their connection, and such a feature selection methodology may be favorable to regular machine learning algorithms. It is advantageous if a feature conforms to or anticipates a class. A distinguishing feature (X i ) is perceived to be relevant if and only if some probability (P)x i and y exist such that P(X i = xi) > 0, as shown in Eq. (2).
Superfluous features must be removed in addition to insignificant features, according to experimental evidence from the feature selection literature. If a feature is overly intertwined with one or more other features, it is considered superfluous. Furthermore, this yielded a hypothesis for feature selection, which is a usable, acceptable characteristic feature subgroup consisting of features that are strongly connected with class but dissociated from one another. If the association among an individual feature and an extrinsic variable is renowned in a provided feature set, and the inter-relationship among each other pair of features is known, then Pearson's correlation coefficient Eq. (3) can be used to calculate the relationship between the complicated test consisting of the total features and the extrinsic variable.
The observed and average values of the features examined are denoted by x i and x, respectively.
The dataset class's observed and average values are defined by y i and y.
If a set of n features is picked, the correlation coefficient would be used to investigate the link between the set and the class while accounting for feature inter-correlation. As the link between features and classes expands, the feature group becomes more important. Furthermore, it lowers as inter-correlation increases. The aggregated correlation coefficient across features and output variables is defined as r ny = p(X n ,Y), and the aggregate among varied features is defined as r nn = p (X n ,X n ). Eq. (4) gives the group correlation coefficient for determining the significance of the feature subset.
J ðX n; Y Þ ¼ nr ny ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n þ ðn À 1Þr nn p In the correlation-based feature selection technique, the Pearson's correlation coefficient is used to allow the inclusion or deletion of one feature at a time.

Random Forest Recursive Feature Elimination (RF-RFE)
RFE is a greedy algorithm-based feature-ranking approach. In order to achieve the most significant features, RFE is used to eliminate the least significant features from the entire feature one by one in each iteration. The recursion is mandatory because significant characteristic features for some processes might change dramatically while evaluating beyond an alternate subset of features during step-wise elimination. This mainly pertains to features which are closely linked. The final feature set is built in the order in which the features are rejected. Simply extracting the first n features from this ranking is the feature selection strategy.
Random Forest is a prediction approach that utilizes an ensemble of models. It constructs a composite predictor by integrating a large number of independent prediction trees. To construct the final forecast of a dataset, an absolute rule is employed among the predictors' options. Furthermore, in order to generate unrelated and diverse insights, each tree is built using a subset of the preparation set's dataset. Additionally, the algorithm incorporates random contingency in the search for optimal splits in order to optimize the dissimilarity among the trees. The proportion of relevance of distinctive features is linked with the recursive feature elimination process in the RF-RFE technique. The RF-RFE approach is based on the notion of frequently generating a random forest model and picking the appropriate or worst operational feature. Eliminate the feature and repeat the process again with the remaining features. This process is repeated until all of the features in the dataset have been utilized. After that, the features are ranked in the order they are deleted. , a greedy optimization technique is used to find the best performing feature subset.

Training Phase
The LightGBM classification algorithm is used to train the data selected by the HFS algorithm. LightGBM is a high-performance gradient boosting machine learning technique that is fast, distributed, and open-source. It employs histogram-based methods to improve training while consuming less memory. LightGBM is very efficient and precise, enables parallel learning, and works well with huge datasets. LightGBM is primarily comprised of two enhanced algorithms: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). LightGBM employs an enhanced histogram technique: the EFB algorithm reduces the number of features, while the GOSS algorithm reduces the number of samples each round of training. Various data instances in GOSS play different roles in calculating information gain, with a higher gradient instance contributing more to the information gain. To preserve the precision of information gain estimates, GOSS retains instances with high gradients and removes instances with low gradients at random [19]. Eq. (5) depicts the mathematical analysis in GOSS.
where b V j ðdÞ: Estimated variance gain over the subset A∪B. To identify the split point, the estimated b V j ðdÞ is employed over a smaller instance subset rather than the accurate V j (d)across all the instances. Simultaneously, LightGBM employs the EFB approach to reduce model complexity by combining exclusive features into a single feature. The logistic regression obtained in Eq. (6) is the loss function [19] in our model. This function is found to be an excellent calibration statistic function for training our detector since it penalizes the difference between real and anticipated odds. Furthermore, it calculates the relative uncertainty between the classes predicted by our method and the actual classes.
where i represent the given observation, y i represent the true value andŷ represent the probability of prediction.

Classification Phase
Furthermore, based on the clusters produced during the training phase, the final trained model in the classification phase with the average of probability rule and voting technique is applied to categorize the test dataset as benign or various attack types inside the test dataset.

Mininet Implementation
This paper host VMware's Mininet Virtual Machine. Mininet is a Python-based open source network emulator that generates a virtual networking architecture that connects virtual hosts through various devices such as switches, links, and controllers. It comes with Linux network software and is capable of supporting OpenFlow for custom routing and SDN. Because mininet must be installed on a Linux server, we picked Oracle VM VirtualBox for our simulations. The simulation was run on a PC running 64-bit Ubuntu 18.04 LTS on a Core-i7 with 16 GB of RAM.

Dataset Description
The authors of [20] presented an improved version of KDDCup'99, termed NSL-KDD, to address the previously noted issues [1]. Despite the NSL-KDD dataset's severe intrinsic challenges, such as the insufficient representation of contemporary low footprint attack scenarios, it is still regarded the most recommended IDSs assessment dataset due to its unique feature of maximizing predictions for classifiers. It is made up of four attack categories with 41 attributes each and a single labeled class that distinguishes between malicious and normal network traffic. The dataset is divided into training and testing dataset. The training set consists of 1, 25, 973 records and the testing set consists of 22, 544 records.

Experimental Results of HFS-LGBM Intrusion Detection System
The experiments were carried out using the WEKA environment and the NSL-KDD dataset. The proposed HSF-LGBM IDS is evaluated using traditional metrics such as precision, recall, accuracy and F1-score.
The proposed method is compared with the existing machine learning classifiers. Support Vector Machine (SVM), Linear Regression (LR), Random Forest, Catboost, Xgboost, LightGBM, and the proposed HFS-LGBM are the machine learning classification algorithms compared in Tab. 1. The comparison is pictorially presented in Fig. 5. From the analysis, it is clear that the proposed HFS-LGBM performs better in terms of accuracy, precision, recall and F-score.
The optimal selected features of NSL-KDD dataset in presented in Tab. 2. In this paper, the proposed HFS-LGBM method is compared with the other feature selection algorithms such as chi, information gain, correlation based feature selection. The results are shown in Tab. 3 and Figs. 6-9. The proposed algorithm has achieved better performance in terms of accuracy, precision, recall and F-measure.

Conclusion
SDNs enhance networking flexibility by separating the control plane and data plane and removing network architectural privacy while opening and scheduling networks. SDN, as a dynamic network, threatens the technological future. In this paper, a novel HFS-LGBM IDS is proposed. The HFS approach combines the advantages of two algorithms such as correlation based feature selection and Random Forest Recursive Feature Elimination. Mininet is used to construct a static software network. To evaluate the proposed system, NSL-KDD dataset is used. The evaluation results of the proposed system are compared with various feature selection and machine learning classification algorithms. According the   Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.