Optimal Deep Belief Network Enabled Malware Detection and Classification Model

Cybercrime has increased considerably in recent times by creating new methods of stealing, changing, and destroying data in daily lives. Portable Document Format (PDF) has been traditionally utilized as a popular way of spreading malware. The recent advances of machine learning (ML) and deep learning (DL) models are utilized to detect and classify malware. With this motivation, this study focuses on the design of mayfly optimization with a deep belief network for PDF malware detection and classification (MFODBN-MDC) technique. The major intention of the MFODBN-MDC technique is for identifying and classifying the presence of malware exist in the PDFs. The proposed MFODBN-MDC method derives a new MFO algorithm for the optimal selection of feature subsets. In addition, Adamax optimizer with the DBN model is used for PDF malware detection and classification. The design of the MFO algorithm to select features and Adamax based hyperparameter tuning for PDF malware detection and classification demonstrates the novelty of the work. For demonstrating the improved outcomes of the MFODBN-MDC model, a wide range of simulations are executed, and the results are assessed in various aspects. The comparison study highlighted the enhanced outcomes of the MFODBN-MDC model over the existing techniques with maximum precision, recall, and F1 score of 97.42%, 97.33%, and 97.33%, respectively.


Introduction
Portable Document Format (PDF) is a very popular and trusted extension, while Adobe Reader is the most commonly used program for opening this type of file. This factor encourages attackers to seek and research for vulnerability and new ways of making exploits that implement random code when opened with this software. PDF document is more commonly utilized to launch attacks by cybercriminal [1]. A PDF document with exciting topics is transferred to the target, and once the document is opened, specific vulnerability in the software configuration or implementation is exploited for launching the next level of attacks [2]. For instance, direct implementation of native executable (when code was embedded in the PDF document itself), injection of code into an operational process or even downloading binary from the internet and later executing them [3,4].
The malware detection method is categorized into signature and behavior methods [5]. Now, signaturebased malware detector effectively works with formerly known malware that has been detected previously by anti-malware vendors. To address this challenge, utilize machine learning (ML) techniques and heuristic analysis that provide high recognition performance [6]. Based on available data, the conventional method in the malware detection field depended on signature analysis [7], which is unacceptable to detect unknown computer viruses. To sustain the appropriate security level, users were forced to timely and constantly upgrade antivirus databases.
The ML technique for malware classification has utilized a wide range of information for learning discriminative functions that can distinguish benign and malicious software. Few common data sources [8] have been studied, including entropy measures on the binary, dynamic system call traces disassembled files, binary files, control flow graphs, and dynamic instruction traces. In the last few years, several attempts have been made to develop a classifier with the malware feature. Data mining and ML methods are utilized for developing smart malware classification and detection techniques [9]. The Deep neural network (DNN) has attained considerable achievement in various applications, particularly in computer vision. Even though the deep learning (DL) model is effective, they have some limitations in real-time detection tasks, particularly in security domain [10]. With the flow of zero-day and unlabeled malware, the recognition accuracy using DL is also lower. This deep model requires a high computational overhead and is very intricate. They also need a considerable amount of hyperparameters, and improved performance can be accomplished by tuning them properly.
This study presents a mayfly optimization with a deep belief network for PDF malware detection and classification (MFODBN-MDC). The proposed MFODBN-MDC model primarily undergoes two stages of pre-processing, namely categorical encoding and null value removal. Moreover, the MFODBN-MDC technique derives an MFO algorithm for optimal selection of feature subsets. Furthermore, Adamax optimizer with the DBN model is used for PDF malware detection and classification. At last, the hyperparameter tuning of the DBN model takes place using the Adamax optimizer. For exhibiting the better performance of the MFODBN-MDC model, a wide range of simulations were executed and the results were evaluated under numerous aspects.
The rest of the paper is organized as follows. Section 2 offers a detailed literature review and Section 3 discusses the proposed model. Then, Section 4 provides experimental validation and Section 5 draws the conclusions.

Literature Review
Corum et al. [11] introduced a learning-based model for identifying PDF malware with processing and image processing methods. The PDF file is initially transformed into grayscale image through the image visualization technique. Next, the image feature represents the visual features of malware and benign PDF files are removed. Lastly, a learning algorithm is employed for creating the classification method to categorize a PDF file as malevolent or benign. Sethi et al. [12] proposed an ML based malware analysis method for accurate and efficient malware classification and detection. Furthermore, we proposed feature selection and extraction modules that extract features from the report and select the essential feature to ensure higher accuracy at a minimal computational cost. We use a distinct ML method for fine-grained classification and accurate detection.
The researchers in [13] proposed the trusted architecture to identify unknown malware in Linux virtual machine (VM) cloud-environment. The presented method obtains volatile memory dump from the examined VM by enquiring about the hypervisor in a reliable way and overpowering malware capability for evading detection and the security mechanism. We use the ML algorithm to leverage informative traces (171 features) from distinct portions of the VM volatile memory. Li et al. [14] developed an evasion mechanism-based feature-vector generative adversarial network (fvGAN) for attacking a learning enabled malware classification. The proposed method was commonly employed in real-time fake image generation. Damaševičius et al. [15] proposed an ensemble classifier-based method for detecting malware. Initially, it is implemented by a convolution neural network (CNN) and stacked ensemble of dense (FC), then it is implemented by a meta-learner. For a meta learner, we compare and explore fourteen classifiers.
In [16], a new malware detection scheme based on a two-phase artificial neural network (ANN) is presented. The presented method is tested on the 'Malimg' dataset comprising of visual depiction of malware family. Here, some significant image features are extracted. According to this feature, the ANN was trained. Next, the ANN is utilized for detecting and classifying other data samples. Shhadat et al. [17] examined the ML algorithm utilized in unknown malware detection. The study proposes a feature set using RF to minimize the amount of features. Various ML methods are employed on a standard dataset in this experiment. Roy et al. [18] aim is to develop a DL-based detector DeepRan for ransomware earlier classification and recognition. The presented method employs an attention based bidirectional long short term memory (Bi-LSTM) with fully connected (FC) layer for modelling normalcy of host in an operating enterprise scheme and detecting anomalous activities from a massive amount of ambient host logging information gathered from bare metal server. The researchers in [19] presented Deep-Hook, a trusted architecture for detecting unknown malware in Linux-based cloud environment. The memory dump is converted as to visual image that is investigated by a CNN based classification.

The Proposed Model
In this study, a new MFODBN-MDC model has been developed for the identification and classification of PDF malware. The proposed MFODBN-MDC model involves three stages of operations such as preprocessing, MFO based feature subset selection, DBN classification, and Adamax hyperparameter optimization. Fig. 1 illustrates the overall process of the MFODBN-MDC technique.

Data Pre-processing
At the initial stage, the input data is pre-processed in two stages of operations such as categorical encoding and null value removal. Firstly, the categorical values are encoded into numerical values. Secondly, the null values that exist in the dataset are removed.

Design of MFO Based Feature Selection Approach
Next to data pre-processing, the MFO algorithm is utilized for the effective choice of the features involved in it [19]. MFO algorithm was proposed by imitating the group behavior of MF, especially the mating behavior. Initially, the mayfly (MF) is classified into male and female populations. In other words, each MF is arbitrarily scattered in a d-dimension space, and it can be taken into account as candidate solutions using the expression of ¼ n 1 ; n 2 ; Á Á Á ; n d ð Þ . Next, the velocity vector represents the modified in location is determined by Movement of male MF: n t _ I denotes the location of iÀth male MF at time t, and $ tþ1 _ Imale denotes the velocity that is added to n t _ I for changing the location of iÀth individuals. The tþ1 location of the male MF n tþ1 ð Þ is formulated as follows The velocity of iÀth MF at jÀth dimension as follows: Whereas n t ij and $ t i;j male denotes the location and velocity of i th MF at j th dimension, correspondingly. a i i ¼ 1; 2 ð Þ signifies the positive attraction constant that responds to the rule of social and cognitive mechanisms. pbest ij and gbest j denotes the local and global optimum locations, correspondingly. b signifies a fixed visibility co-efficient that limits the visibility of individuals to other individuals. p p and p g denotes the cartesian distance in i th MF to the local and global optimum solutions, correspondingly. To minimize problem, the local optimum values pbest ij and optimum global values gbest j is estimated by the following equation Whereas f 1 ; . . . ; c : R n ! R characterizes the objective function. The optimum MF in the population continually execute an up-and-down nuptial dance to guarantee the efficient process, viz., velocity of the optimum MF should always be changed as follows: In which d characterizes the nuptial dance coefficient, p indicates an arbitrary value within À1; 1 ½ , and $ t ij; male indicates the location of the iÀth male MF at the jÀth parameter. Movement of female MF: Different from male MF that gathers in swarm, the female individual is towards the male individual to breed. The existing location and the velocity of iÀth female MF at time r are fixed to w t i and $ tþ1 i; female , correspondingly. Next, the tþ1 ð ÞÀth location of the female MF is given by: During the optimization method of MFO algorithm, the attraction approach is determined by a deterministic system. Regarding the minimized problem, the velocity of i-th female MF at j-th parameter is estimated as follows Whereas w t _ Ij and $ t ij; female denotes the location and velocity of i th female MF at j th variable. p mf means the cartesian distance from i th male MF to i th female MF. fl characterizes the arbitrary walk co-efficient.
Mating of MF: Two parents are carefully chosen in male as well as female populations, correspondingly. The mating rule depends on the mating of optimal male with optimal female, creating 2 offspring based on the following equations: Here, males and females denote the male and female individuals of the preceding generation, and l 2 0; 1 ð Þ signifies an arbitrary number. The primary velocity of the individual in the present generation denotes 0: The MFO algorithm derives a fitness function using two parameters for the effective selection of features namely classification accuracy and number of chosen features. It can be derived as follows.
where p signifies total number of features and s denotes the number of chosen features. Here, the value of x 1 and x 2 are 1 and 0.001, respectively, [17]. The acc classifer ð Þrepresents the overall classifier accuracy attained by the DBN model that can be attained using Eq. (11): where n i and n c represents the number of wrongly and properly classified samples respectively.

Process Involved in Optimal DBN Based Classification
For the identification and classification of PDF malware, the DBN model [20] is applied in this work. DBN is a multi-layered probabilistic model [20] that comprises multi-parameters for learning models. All the layers contain a simple undirected graph named restricted Boltzman machine (RBM). The RBM layer is of two kinds, that is visible layer and hidden layer. The hidden layer denotes the top layer, and visible layer represents the bottom layer. Fig. 2 illustrates the framework of DBN. An RBM encode the joint likelihood distribution through the energy function, where v denotes the visible data, h indicates the hidden data, w represents the weight, and h ¼ w; . It can be expressed as follows.
This rule can be derived to upgrade the primary state; thus, each update gives a low energy state and eventually settles into equilibrium. Now, r x ð Þ ¼ 1= 1 þ exp Àx ð Þ ð Þ , whereas the sigmoid function is detected as follows: The visible layer is offered with the input data for training the RBM. Now, the learning is to adopt the variable h thus the likelihood distribution becomes maximally analogous to the true value implies that it maximizes the log-probability of observed data. The contrastive divergence (CD) samples the value for each hidden layer and the present input gives a whole sample v data ; h data ð Þ . It can be attained the sample from the model as v model ; h model ð Þ . The weight is upgraded as follows In order to effectually modify the hyperparameter values of the DBN model, the Adamax optimizer is utilized [21]. Adamax is a variant of Adam dependent upon the infinity norm. Here, the update rules for separate weight measure the gradient inversely proportionate to a (scaled) L 2 norm of the present and previous gradients. Then, generalize the L 2 norm-based updating rules to L p norm-based updating rules. This variant becomes arithmetically unstable for larger p. But, in the special case [22], we consider p ! 1; which emerges as a stable and simple approach. Update biased first moment estimation: Update the exponentially weighted infinity norm: Update parameter: The default setting for the tested ML problem is a ¼ 0:002; b 1 ¼ 0:9; b 2 ¼ 0:999:
Tab. 1 provides detailed classification outcomes of the DBN and MFODBN-MDC models on two datasets. Fig. 8 reports the result analysis of the MFODBN-MDC model and DBN model on the CIC Evasive-PDFMal2022 dataset. The results indicated that the DBN model has obtained accc y , prec n , reca l , F1 score , and Area Under the Curve (AUC) of 89.38%, 89.85%, 90.18%, 89.38%, and 91.31% respectively. However, the MFODBN-MDC model has offered enhanced performance with accc y , prec n , reca l , F1 score , and AUC of 95.58%, 95.55%, 95.51%, 95.53%, and 98.91% respectively. Fig. 9 shows the result analysis of the MFODBN-MDC and DBN models on Contagio dataset. The results showed that the DBN system has gained accc y , prec n , reca l , F1 score , and AUC of 93.93%, 94.17%, 93.93%, 93.92%, and 92.40% correspondingly. But the MFODBN-MDC technique has presented enhanced performance with accc y , prec n , reca l , F1 score , and AUC of 97.33%, 97.42%, 97.33%, 97.33%, and 99.30% correspondingly.       Fig. 11 proves a comparative prec n , reca l , and F1 score examination of the MFODBN-MDC model with existing model. The figure reports that the DT, RF, and RR methods have shown poor performance with minimum values of prec n , reca l , and F1 score . Next, the AdaBoost and SGDC approaches have reported somewhat improved values of prec n , reca l , and F1 score . In line with, the LR method has gained considerately prec n , reca l , and F1 score values of 95.73%, 96.38%, and 96.39%. However, the MFODBN-MDC method has resulted in maxima prec n , reca l , and F1 score of 97.42%, 97.33%, and 97.33%. After examining the abovementioned tables and figures, it is clear that the MFODBN-MDC model has accomplished maximum PDF malware detection and classification outcomes.

Conclusion
In this study, a MFODBN-MDC technique was established for the identification and classification of PDF malware. The proposed MFODBN-MDC technique contains three stages of operations such as preprocessing, MFO based feature subset selection, DBN classification, and Adamax hyperparameter optimization. For exhibiting the better performance of the MFODBN-MDC model, a wide range of simulations are executed, and the outcomes are evaluated under various aspects. The extensive comparative analysis reported the enhanced outcomes of the MFODBN-MDC model over the recent approaches. Therefore, the MFODBN-MDC model can be utilized as a proficient tool for PDF malware detection and classification. In the future, the classification results of the MFODBN-MDC model can be improved by using outlier detection and feature reduction approaches.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.