Optimal Deep Belief Network Enabled Cybersecurity Phishing Email Classification

Recently, developments of Internet and cloud technologies have resulted in a considerable rise in utilization of online media for day to day lives. It results in illegal access to users’ private data and compromises it. Phishing is a popular attack which tricked the user into accessing malicious data and gaining the data. Proper identification of phishing emails can be treated as an essential process in the domain of cybersecurity. This article focuses on the design of biogeography based optimization with deep learning for Phishing Email detection and classification (BBODL-PEDC) model. The major intention of the BBODLPEDC model is to distinguish emails between legitimate and phishing. The BBODL-PEDC model initially performs data pre-processing in three levels namely email cleaning, tokenization, and stop word elimination. Besides, TF-IDF model is applied for the extraction of useful feature vectors. Moreover, optimal deep belief network (DBN) model is used for the email classification and its efficacy can be boosted by the BBO based hyperparameter tuning process. The performance validation of the BBODL-PEDC model can be performed using benchmark dataset and the results are assessed under several dimensions. Extensive comparative studies reported the superior outcomes of the BBODLPEDC model over the recent approaches.


Introduction
With the rapid development of communication and global networking techniques, lots of our day-to-day life activities like electronic banking, e-commerce, social networks, and so on are transported to cyberspace [1]. The uncontrolled, open, and anonymous structure of the Internet allows an outstanding environment for cyberattacks that presented severe security susceptibilities for standard computer users, experienced ones, and networks. The procedure of defending cyberspace from attack is called Cyber Security [2]. Cyber Security is recovering, protecting, and preventing each resource that uses the internet from cyberattack [3]. The difficulty in the cybersecurity field rises day-to-day, making controlling, identifying, and analyzing the appropriate risk event important problems. A cyberattack is digital malevolent attempt to intrude, steal, or damage the organizational confidential or personal information [4]. Even though experience of the user and carefulness are significant, it is impossible to entirely prevent users from falling into the phishing scam [5]. A phishing attack is a type of societal production attack widely employed for embezzling user data, including credit card numbers and login testimonials. This happens when an aggressor, hidden as a trusted individual, targets a victim to modify email data, namely content message, or instantaneous message. Fig. 1 illustrates the process of phishing classification model.
To obtain personal information, criminal develops illegal replicas of email and real websites, generally from an organization or financial institution handling financial information [6]. This e-mail is rendered by authentic slogans and company logos. The structure and design of hypertext markup language (HTML) allow copying of entire website or an image [7]. As well, it is the major factor for the quick expansion of Internet as a transmission network and allows the misuse of trademarks, brands, and company identifiers that customer relies on as validation mechanism [8]. To trap users, Phisher sends "spooled" emails to largest number of people. Once this e-mail is opened, the customer tends to be distracted from the authentic entity to spoofed websites. There is an important possibility of exploitation of user data. For that reason, phishing in current society is overly critical, very urgent, and challenging [9].
There are numerous researchers against phishing according to the faces of domains, like website content, website uniform resource locator (URL), incorporates this two website URL and content, the screenshot of the website and the source code of website [10]. But there is a lack of valuable anti-phishing tools to identify malevolent URLs in an institution for protecting their user. In case of malevolent code being rooted on the website, attackers might install malware and steal user information that possess a severe threat to user privacy and cybersecurity. Malicious URL on the Internet is identified easily by examining them via Machine Learning (ML) approach.
This article focuses on the design of biogeography based optimization with deep learning for Phishing Email detection and classification (BBODL-PEDC) model. The BBODL-PEDC model initially performs data pre-processing in three levels namely email cleaning, tokenization, and stop word elimination. In addition, Term Frequency -Inverse Document Frequency (TF-IDF) model is applied for the extraction of useful feature vectors. Followed by, optimal deep belief network (DBN) model is used for the email

Related Works
Saha et al. [11] introduced the data-driven structure to detect phishing webpage utilizing deep learning (DL) technique. In particular, multilayer perceptron (MLP) that is also mentioned that feed forward neural network (FFNN) was utilized for predicting the phishing webpage. The data set is gathered in Kaggle and comprises data of ten thousand webpages. Opara et al. [12] presented HTMLPhish, a DL based datadriven end-to-end automatic phishing webpage classifier method. Especially, HTMLPhish takes the content of HTML document of a webpage and utilizes convolutional neural network (CNN) for learning the semantic dependence from the textual content of HTML. The CNN learned suitable feature representation in the HTML document embedded with no extensive manual feature engineering.
Ra et al. [13] utilized word embedded and Neural Bag-of-ngrams with DL approaches for detecting phishing emails. Combined word embedded and Neural Bag-of-ngrams enable for extracting syntactic and semantic similarity of emails. DL techniques [14] enable for extracting the abstract and optimum feature representations and fully connected (FC) layer with nonlinear activation function to classifier. According to an enhanced recurrent CNN (RCNN) technique with multilevel vectors and attention process, Fang et al. [15] presented a novel phishing email recognition method called THEMIS that is utilized for modeling email at the word level, email header, email body, and character level concurrently. For evaluating the efficacy of THEMIS, it utilizes an unbalanced data set which is realistic ratios of phishing and legitimate email.
Bagui et al. [16] implemented deep semantic analysis, and ML and DL approach, for capturing inherent features of emails text, and classifying email as phishing/non-phishing. Zamir et al. [17] presented a featurecentric framework (FSEDM) dependent upon current and novel features of emails dataset that is removed after pre-processed. Then, varied supervised learning approaches are executed on the presented feature from conjunction with feature selection (FS) approaches namely gain ratio, information gain, and Relief-F to rank one of the noticeable features and classify the emails to spam/ham (not spam).

The Proposed Model
In this article, a new BBODL-PEDC technique has been developed for Phishing Email detection and classification, which effectively distinguished the emails into legitimate and phishing. The BBODL-PEDC model involves a series of subprocesses namely pre-processing, TF-IDF vectorizer, DBN based classification, and BBO based hyperparameter optimization.

Pre-Processing
Primarily, cleaning of data is performed including the removal of unwanted words as well as characters. Once the data is cleaned, the email data get pre-processed as follows [18].

Body text extraction White space elimination via text parsing Convert every character into lowercase and remove non-alphanumeric characters
The BBODL-PEDC model initially performs data pre-processing in three levels namely email cleaning, tokenization, and stop word elimination. Firstly, email cleaning procedure is carried out to remove the unwanted data and non-English characters. Next, tokenization is performed where every email is broken into a set of words, depending upon white spaces. The words obtained are named tokens. Then, stop words which do not carry important data are removed, like conjunction, article, preposition, etc.

TF-IDF Model
The most commonly utilized measure from the data retrieval is td-idf. These data weight methods are utilized for measuring the probability-weighted count of data in provided documents. During the convention data model, idf is understood as 'the count of data' provided as log of inverse probabilities. By itself, tf-idf has measured that multiples the 2 quantities tf and idf. Thus, term frequency offers evaluation of occurrence probabilities of the term if it can be normalization by the entire frequency from the documents, or document gathering, dependent upon the scope of computation. According to the fundamental equation of data model, the document has been considered that provided disorderly group of terms. Assume D = {d j , …, d n } be group of documents and W = {w i , …, w M } be group of various terms limited in D. During this analysis, document D was signified as the corpus of data removed in the tweeter feed but W refers the query term. The parameter N stands for the entire amount of documents but M is the amount of terms. During the adjusting the model, selective of terms w i in W and selective of documents d j in D are also regarded.

DBN Based Email Classification
At this stage, the DBN model is utilized for the classification of emails into phishing and legitimate ones. DBN is a type of probabilistic generative method that establishes the joint distributions amongst input and label information via the learning procedure [19]. Rationally developing the architecture of the DBN models like the amount of layers of the restricted Boltzmann machine (RBM), could efficiently enhance the classifier performance. Determine rational DBN operating parameter includes, including the amount of positive unsupervised learning, the quantity of hidden layers, and the learning rate could significantly enhance the performance of the classifier outcomes. Considering the training efficiency and classification effect of the models, DBN models using a network requirement of 124-250-250-2 is created.
Through comparing the classifier efficacy and setting up a control experiment of the models, it can be defined that the RBM layer fixed by the DBN architecture are 2 layers. RBM is a generative neural network (NN) system. A single RBM is a 2-layer NN comprised of hidden and visible layers. The neuron in all the layers isn't linked, and there is no self-feedback phenomenon from the layer. The neuron in visible and hidden layers are FC in two directions. The energy function among the hidden and visible layers is formulated by: whereas ω ij indicates the weight connects i and j visible and hidden layers. b 1 and b 2 indicates the biases of visible and hidden layer neurons, correspondingly. Amongst them, the joint likelihood distribution among neurons was estimated by: assume the input value of DBN architecture is X and resultant value of hidden layer is H, later the weight as well as bias updating equation connect the hidden and output layer neurons as In which δ k shows the variance among the true type of input values and the actual output value of DBN. ɛ represent the learning rate of DBN. The classification method of DBN architecture comprises reverse supervised "fine-tuning" learning and forward unsupervised "layer-by-layer initialization" learning. The initial phase of training is named as pretraining method. Fig. 2 demonstrates the framework of DBN.
The DBN framework implements forward training via a layer-wise initialization learning model. Through stacking the RBM layer, transfer and map the characteristics data of the input information sequentially. The suggested technique has a Softmax classification on top of RBM. The Softmax classification receives the output data of the top RBM as input data. The Softmax classifiers output the results of forward learning method with the comparison of likelihood distribution. The Softmax classification is created by a multinomial distribution. It is realized that the LR classification confronts generalized induction of various classifiers and is utilized for multiclass classifier problems. The aim is for translating the output data of RBM to a likelihood distribution. The arithmetical depiction of Softmax classification is given below: whereas y denotes the output vector of RBM. The next phase of training is named the finetuning method. By using the initial phase of pre-training, the RBM layer ensures that the weights of layer reach the optimum perform of feature data of layer and makes the mapping of input data of whole DBN reaches the optimum.

BBO Based Hyperparameter Optimization
At the final stage, the BBO algorithm [20] is employed for the optimal hyperparameter tuning of the DBN model. Biogeography is the analysis of mutation, migration, speciation, and extinction of species. Biogeography is often supposed that process is compelled equilibrium from the amount of species from the islands. But, the equilibrium in a method is also observed as minimal-energy configuration, thus it can be realized that biogeography was regarded as an optimized procedure. BBO algorithm is a novel Figure 2: DBN structure evolution technique established to the global optimized. It can be simulated as the immigration as well as emigration of species amongst islands (or habitats) from the searching to further well-suited islands. All the solutions are named as "habitat" (or "island") with habitat suitability index (HSI) and demonstrated as n-dimensional real vectors. A primary individual of habitat vectors is created at arbitrary.
The habitat with maximum HSI was regraded that optimum solution, but the habitat with minimum HSI was regraded that poor solution. The minimum HSI is taken in several novel optimum features procedure the maximum HSI, and this minimum HSI solution has a comparatively higher possibility which developed maximum HSI solution. In BBO, habitat H refers the vector of n (suitable index vector (SIV)) initialize arbitrarily and then executes migration and mutation function for achieving the optimum solutions. A novel candidate solutions are created in the total habitat from population utilizing the migration as well as mutation functions. In BBO, the migration function is to modify present habitat and alter present solution. Migration is a probabilistic function which adjusts habitat X i . The probability X i altered has proportional to their immigration rate λ i , and the source of altered probability in X j has proportional to rate of emigration μ j .
The mutation is also a probabilistic function which arbitrarily changes habitat SIV dependent upon the habitat a priori probability of existences. The extremely higher HSI solution and extremely lower HSI solution were correspondingly improbable. Medium HSI solution is comparatively probable. The mutation rate m has formulated as: where m max implies the adjustable parameters. Moreover, the mutation function deals with improving the population diversity Mutation. Fig. 3 depicts the process flow of BBO technique.

Experimental Validation
The experimental result analysis of the proposed model is validated using a benchmark CLAIR dataset [21], which contains 3685 phishing and 4894 legitimate Emails.   For instance, under run-1, the BBODL-PEDC model has attained prec n , reca l , accu y , and F score of 99.11%, 99.32%, 99.32%, and 99.21% respectively. In addition, on run-2, the BBODL-PEDC model has obtained prec n , reca l , accu y , and F score of 99.24%, 99.32%, 99.38%, and 99.28% respectively. Along with that, on run-3, the BBODL-PEDC model has offered prec n , reca l , accu y , and F score of 99.21%, 99.00%, 99.23%, and 99.10% respectively. Followed by, on run-4, the BBODL-PEDC model has reached prec n , reca l , accu y , and F score of 99.21%, 98.91%, 99.20%, and 99.06% respectively. In line with, on run-5, the BBODL-PEDC model has exhibited prec n , reca l , accu y , and F score of 99.10%, 99.13%, 99.24%, and 99.12% respectively. Finally, on run-10, the BBODL-PEDC model has accomplished prec n , reca l , accu y , and F score of 99.02%, 99.00%, 99.15%, and 99.01% respectively.  Finally, an extensive comparison study of the BBODL-PEDC model with recent approaches is made in Tab. 2. Fig. 7 exhibits a comparative prec n examination of the BBODL-PEDC model with existing ones. The figure portrayed that RCNN and machine learning accelerator-natural language processing (MLA-NLP) models have obtained lower performance with rec n of 96.53% and 95% respectively. In addition, the DL and hierarchical long short term memory (H-LSTM) models have attained moderately reduced prec n values of 97% and 97.45% respectively. Along with that, the graph convolutional network (GCN) model has resulted in competitive prec n of 98.50%. However, the BBODL-PEDC model has outperformed the other methods with prec n of 99.15%.     The figure portrayed that the DL and MLA-NLP models have found poor results with F score of 96.00% and 95.36% respectively. Additionally, the RCNN and H-LSTM models have accomplished moderately reduced F score values of 97.12% and 96.71% respectively. Besides, the GCN model has reached near optimal F score of 98.55%. However, the BBODL-PEDC model has outperformed the other methods with F score of 99.12%.
After examining the above mentioned tables and figures, it is evident that the BBODL-PEDC model has shown effective results over the other methods.

Conclusion
In this article, a new BBODL-PEDC technique has been developed for Phishing Email detection and classification, which effectively distinguished the emails into legitimate and phishing. The BBODL-PEDC model involves a series of subprocesses namely pre-processing, TF-IDF vectorizer, DBN based classification, and BBO based hyperparameter optimization. The efficacy of the DBN model can be  : Acc y analysis of BBODL-PEDC technique with recent algorithms boosted by the BBO based hyperparameter tuning process. The performance validation of the BBODL-PEDC model can be performed using benchmark dataset and the results are assessed under several dimensions. The extensive comparative studies reported the superior outcomes of the BBODL-PEDC model over the recent approaches. In future, advanced DL models with hybrid metaheuristic optimization algorithms can be designed for phishing email detection.