<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">18260</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2021.018260</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Toward Robust Classifiers for PDF Malware Detection</article-title>
<alt-title alt-title-type="left-running-head">Toward Robust Classifiers for PDF Malware Detection</alt-title>
<alt-title alt-title-type="right-running-head">Toward Robust Classifiers for PDF Malware Detection</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western">
<surname>Albahar</surname>
<given-names>Marwan</given-names>
</name>
<xref ref-type="corresp" rid="cor1">&#x002A;</xref>
</contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western">
<surname>Thanoon</surname>
<given-names>Mohammed</given-names>
</name>
</contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western">
<surname>Alzilai</surname>
<given-names>Monaj</given-names>
</name>
</contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western">
<surname>Alrehily</surname>
<given-names>Alaa</given-names>
</name>
</contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western">
<surname>Alfaar</surname>
<given-names>Munirah</given-names>
</name>
</contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western">
<surname>Algamdi</surname>
<given-names>Maimoona</given-names>
</name>
</contrib>
<contrib id="author-7" contrib-type="author">
<name name-style="western">
<surname>Alassaf</surname>
<given-names>Norah</given-names>
</name>
</contrib>
<aff><institution>College of Computers in Al-Leith, Umm Al Qura University</institution>, <addr-line>Makkah</addr-line>, <country>Saudi Arabia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1">&#x002A;Corresponding Author: Marwan Albahar. Email: <email>mabahar@uqu.edu.sa</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-07-13"><day>13</day><month>07</month><year>2021</year>
</pub-date>
<volume>69</volume>
<issue>2</issue>
<fpage>2181</fpage>
<lpage>2202</lpage>
<history>
<date date-type="received"><day>03</day><month>3</month><year>2021</year>
</date>
<date date-type="accepted"><day>19</day><month>4</month><year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2021 Albahar et al.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Albahar et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_18260.pdf"></self-uri>
<abstract>
<p>Malicious Portable Document Format (PDF) files represent one of the largest threats in the computer security space. Significant research has been done using handwritten signatures and machine learning based on detection <italic>via</italic> manual feature extraction. These approaches are time consuming, require substantial prior knowledge, and the list of features must be updated with each newly discovered vulnerability individually. In this study, we propose two models for PDF malware detection. The first model is a convolutional neural network (CNN) integrated into a standard deviation based regularization model to detect malicious PDF documents. The second model is a support vector machine (SVM) based ensemble model with three different kernels. The two models were trained and tested on two different datasets. The experimental results show that the accuracy of both models is approximately 100%, and the robustness against evasive samples is excellent. Further, the robustness of the models was evaluated with malicious PDF documents generated using Mimicus. Both models can distinguish the different vulnerabilities exploited in malicious files and achieve excellent performance in terms of generalization ability, accuracy, and robustness.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Malicious PDF classification</kwd>
<kwd>robustness</kwd>
<kwd>guiding principles</kwd>
<kwd>convolutional neural network</kwd>
<kwd>new regularization</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Malware remains a hot topic in the field of computer security. It is employed by criminals, industries, and even government actors for espionage, theft, and other malicious endeavors. With several million new malware strains emerging daily, identifying them before they harm computers or networks is one of the most pressing challenges of cyber security. Over the last 20 years, hackers have continuously discovered new forms of attacks, giving rise to numerous malware types. Some hackers have utilized macros within Microsoft Office documents, while others have found code in JavaScript files <italic>via</italic> which browsers were vulnerable. The implication of the range of malware is the necessity of novel automated technology for addressing these attacks. A popular form of the document file is the Portable Document Format (PDF). Although users were unaware, the PDF was transformed into a significant attack vector (AV) for malware operators. Each year, many vulnerabilities are revealed in Adobe Reader, the most widely used software for reading PDF files [<xref ref-type="bibr" rid="ref-1">1</xref>], enabling hackers to commandeer targeted computers. There are three primary forms of PDF malware: exploits, phishing, and misusing PDF capability. Exploits work by exploiting bugs in the API of the PDF reader application, allowing hackers to run code on an attacked computer. This is typically achieved using JavaScript code embedded within files. In phishing attacks, the PDF file itself is harmless, but it requests users to click on a link(s) that exposes the user to an attack. Campaigns of this type have recently been uncovered [<xref ref-type="bibr" rid="ref-2">2</xref>] and are considerably more challenging to recognize because of their format. Misusing PDF capabilities entails exploiting PDF file functionality, <italic>e.g</italic>., running commands or launching files. Each of these attack types may result in severe outcomes, <italic>e.g</italic>., hackers stealing website credentials or persuading victims to download malicious executables. Although researchers have recently begun using machine learning to detect malware, antivirus software manufacturers have primarily focused on detecting malicious PDFs using handwritten signatures. This approach demands a considerable investment in human resources and is typically weak at recognizing novel variants and zero-day exploits [<xref ref-type="bibr" rid="ref-3">3</xref>]. An alternative and widely used approach is to perform dynamic analysis by executing files within controlled sandbox environments [<xref ref-type="bibr" rid="ref-4">4</xref>], making the detection of new malware far more likely. However, it also takes considerably longer and requires a sandbox virtual machine. Furthermore, such techniques still need human intervention to define the rules of detection based on the behavior of the files.</p>
<p>Consequently, feature engineering improvements in the design of malicious PDF classifications are challenging but have a substantial and significant impact in the domain. Several approaches have been proposed to improve the robustness of classification algorithms and reduce the evasion rate of malicious files. However, new attack techniques have been quickly developed to evade these approaches. Therefore, propose several models for enhancing the robustness in the detection of malicious PDF files.</p>
<p>In this study, two different classification models are trained to detect malicious and benign PDF files. The dataset utilized in this study was obtained from the VirusTotal and Contagio platforms. In previous research [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>], the data used for training the algorithms is approximately one-third or less the size of the data used in this study. Using small datasets decreases the generalizability of the training model, which underfits the data and yields ambiguous results. Furthermore, the algorithm designed by He et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] achieved an accuracy of more than 98%, with imbalanced data having a higher proportion of malicious data over benign data.</p>
<p>Consequently, it fails to deliver the same detection performance with a balanced dataset. Falah et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] incorporated a balanced dataset containing approximately 1,000 PDF files. However, the performance of that model is lower than 98%. To detect malicious PDF files, we incorporated a new regularization method based on the standard deviation of the weight matrix in a convolutional neural network (CNN) and an ensemble model based on a support vector machine (SVM) trained for the same purpose. We used more than 50% of the sourced data for training and testing, which increased the detection performance and robustness of the model. To avoid ambiguity and discrepancy in the simulation results, the datasets used for training and evaluating the machine learning models are balanced. Finally, both models perform well, but the proposed CNN-based model produces higher performance measures than state-of-the-art models.</p>
<p>The rest of this paper is organized as follows: In Section 2, we briefly discuss PDF file structure and evasion techniques. In Section 3, we present related research. The dataset used is presented in Section 4, and the methodology is presented in Section 5. We discuss the regularizer used in this study in Section 6 and introduce the system model in Section 7. An extensive analysis of the results is presented in Section 8, a comparison to other models in Section 9, and in Section 10, we explain the limitations of the study. Finally, we present the conclusions and future research directions in Section 11.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Background</title>
<sec id="s2_1">
<label>2.1</label>
<title>PDF Structure</title>
<p>The PDF comprises four sections: header, body, cross-reference table, and trailer. Each section provides different information about the file. For example, the file format and version are identified in the header section and with a magic number. While the body section comprises different objects (<?A3B2 "fig1",5,"anchor"?><xref ref-type="fig" rid="fig-1">Fig. 1</xref>), the object types include but are not limited to arrays, dictionaries, and name trees. The object type and its content can significantly facilitate the classification of the objects. For instance, JavaScript code is contained within a JS or JavaScript object, as required by the PDF standards. Attackers repeatedly target this object maliciously, and the previous conceptual methods for detecting these attacks were by finding the JS and JavaScript objects. However, the new trend adopted by attackers is to hide JavaScript in indirect objects of various types.</p>
<p>Further, the new trend of hiding malicious code requires the classifier to scan all objects, as was done in the study by Raff et al. [<xref ref-type="bibr" rid="ref-7">7</xref>]. The direct application of n-gram analysis to the entire file disregards the fact that the object content for each form differs significantly, resulting in a lack of robustness. The alternative is to include more semantic information in classification models trained at a higher granularity level. Therefore, training an abnormal classifier essentially achieves this because there is a considerable difference between the content of objects [<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Tree structure and content of a PDF file</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-1.png"/>
</fig>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Classical Antivirus Models for PDF Files</title>
<p>Manufacturers of antivirus software employ several methods for detecting PDF malware. Signature-based detection is the simplest and most frequently used method for identifying malicious files [<xref ref-type="bibr" rid="ref-8">8</xref>]. In this method, a security analyst manually inspects malicious files, extracting one or more patterns in the byte code (the <italic>signatures</italic>), which are stored in a database. When faced with new files, the analyst checks whether any code in the new files matches codes in the database. If there is a match, the file is blocked. Another essential way of detecting malware is static analysis. In this technique, heuristic rules are applied to the content of the file to assess the likelihood of malware being present. The simplest way of doing this is to search for specific keywords, such as GoTo, JavaScript, or Open-Action, which are tags that can be used to harm a computer system. If none of these tags are present, the analyst gives the file a pass [<xref ref-type="bibr" rid="ref-9">9</xref>] (though certain attackers can now insert JavaScript codes without a matching JavaScript tag). Dynamic analysis is a more costly but possibly more robust means of identifying malware. This technique requires files to be run within a controlled environment (sandbox), in which API calls are evaluated and retrieved while checking the network for activity created by potential malware. Programs can then apply a heuristic approach using activity logs, <italic>e.g</italic>., launching some of the processes for connecting with malicious websites [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>PDF Malware Classifiers and Evading Techniques</title>
<p>In this subsection, we examine a pair of open-source PDF malware classifiers that have received a good deal of attention from security analysts: PDFrate [<xref ref-type="bibr" rid="ref-10">10</xref>] and Hidost [<xref ref-type="bibr" rid="ref-11">11</xref>].
<list list-type="bullet">
<list-item><p>PDFrate [<xref ref-type="bibr" rid="ref-8">8</xref>] employs 202 elements, including counting for multiple keywords and fields within the PDF. For example, checking how many characters there are in the author field, checking how many <italic>endobj</italic> keywords there are, checking the total number of pixels in all images, and checking the number of JavaScript Mark occurrences. This is done using a random forest classifier, which is 99% accurate and returns 0.2% false positives compared to the Contagio malware dataset [<xref ref-type="bibr" rid="ref-12">12</xref>]. Basic manipulation of a PDF file can cause extremely significant alterations to the feature values in PDFrate. One example would be adding pages from untampered documents to PDF malware, increasing the page count feature to the maximal integer value. This influences numerous other counts.</p></list-item>
<list-item><p>Hidost [<xref ref-type="bibr" rid="ref-11">11</xref>] employs bag-of-path features harvested from the parsed tree structure of the PDF. This finds the shortest structural path to every object, including the terminals and non-terminals within the tree, employing binary counts for the paths as features. Research is undertaken only for paths appearing in a minimum of 1,000 files within the corpus, reducing the number of paths from 9,000,000 to 6,087. Evaluation of the Hidost has been done using an SPM model and a decision tree model. These models both boast 99.8% accuracy, with under 0.06% of false positives returned. The binary bag-of-path features can find the input conjointly with the classifier if some specific attack properties are present.</p></list-item>
</list></p>
<p>Nevertheless, quantifying the relationship between class labels and selected features with a high correlation with the labels used in the training process is preferred. It is easy to satisfy this preference with standard classification tasks that do not suffer from malware attacks. It is possible to simply employ highly correlated features to establish highly accurate classification models. The problem is that certain features have low correlations with class labels. Within this context, the selected features used by Hidost and PDFrate frequently have high correlations at low-level relationships with class labels. Manual analysis has shown that such features do not necessarily have a causal relationship with the degree of maliciousness of a PDF file. Consequently, accuracy rates fall swiftly when attacks occur, despite such techniques boasting accuracy rates of over 99%. In contrast, features like shellcode occurrences, heap spray, and JavaScript confusion have robust causal relationships with the degree of maliciousness of the sample, as required for functional implementation. It can be problematic to directly discover features that possess robust causal relationships with class labels but removing features with low relationships is straightforward. In practical terms, causal feature selection requires a comprehensive analysis of the samples [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Evading Malware Classifiers Automatically</title>
<p>Several automated attacks have found ways to successfully evade PDF malware classifiers using various threat models.
<list list-type="bullet">
<list-item><p>White-box attacks, it is generally assumed that the attackers have 100% knowledge of a system. Thus, they can directly attack the precise model under training, <italic>e.g</italic>., gradient-based attacks [<xref ref-type="bibr" rid="ref-14">14</xref>,<xref ref-type="bibr" rid="ref-15">15</xref>]. One example in a white-box environment is the Gradient Descent and Kernel Density Estimation (GD-KDE) attack, which targets the SVM version of PDFrate [<xref ref-type="bibr" rid="ref-15">15</xref>]. Furthermore, Grosse et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] employed an approach that simply adds features to preserve the extant malicious functions of the adversarial malware example [<xref ref-type="bibr" rid="ref-17">17</xref>]. A problem with white-box gradient-based attacks is that instances of evasion are revealed within the feature space such that the PDF malware is not actually generated.</p></list-item>
<list-item><p>With a black-box attack, it is generally assumed that the attackers have no access to the parameters of the model. However, they do have oracle access to prediction labels with certain samples and access to prediction confidence in certain instances. In specific environments, the assumption is also made that the model type and features are also known. Xu et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] employed genetic evolution algorithms for the automatic invasion of Hidost and PDFrate. This evolutionary algorithm employs fitness scores for feedback, guiding the quest for difficult to find PDF variants through seed PDF malware mutations. For each generation of the population in the search, the attack employs cuckoo oracles to make dynamic checks confirming whether mutated PDFs retain their malicious functions. Such checks are far more robust than the static insertion-only techniques employed in gradient-based attacks. Dang et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] employed a more constrained threat model that does not give classification score access to attackers, only allowing access to classified labels and black-box morphs for PDF manipulation. They employed the hill-climbing scoring function to attack the classifier using these assumptions.</p></list-item>
</list></p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Related Research</title>
<p>The signature-based detection method was the standard in cybersecurity, and for the average researcher, it was the method of choice for spotting malicious PDFs [<xref ref-type="bibr" rid="ref-20">20</xref>]. However, there are now more obstacles due to the rapid rise in threats, increasing the effort required for handwritten rules, and the recent pervasiveness of machine learning detection capabilities. Cross et al. [<xref ref-type="bibr" rid="ref-21">21</xref>] performed an analysis of static and dynamic PDF files and proposed a specific set of characteristics for exploring potential harmful actions by PDF files. Those characteristics include launching a program, responding to a user action, running JavaScript, and describing the file format <italic>via</italic> the presence of an xref table or the number of pages. Cross et al. trained a decision tree classifier on a small dataset of 89 malicious and 2677 benign samples. Thus, they detected approximately 60% of the malware with a precision of 80% on 5-fold cross-validation. Smutz et al. [<xref ref-type="bibr" rid="ref-22">22</xref>] researched the metadata and structure of documents using the feature extraction method. The intended structural features are the number of specific indicative strings (/JS, /Font) or the position of some objects in the file. The higher-level characteristics of the file indicate the metadata features, <italic>e.g</italic>., when the unique identifier (pdfid0) of the PDF is not a match. Smutz et al. built a random forest classifier with 10,000 files as the training set, 100,000 files (297 malware) as the operational dataset, and 202 manually selected features. Running the training set through 10-fold cross-validation, the results are larger than 99% without any false positives. Furthermore, they performed a malware classification with a false positive rate (FPR) of only 0.2% with that test set. However, these results may cause severe overfitting of the model. A downside of the overfitting problem when running 10-fold cross-validation with random sampling is that it increases the probability of matching files at each iteration for both the training and test sets. Additionally, counting the /Font string as a significant feature of the model may facilitate finding matching files because of a rise in their variance. Tzermias et al. [<xref ref-type="bibr" rid="ref-23">23</xref>] utilized static analysis features, simulating the JavaScript code in the file and detecting 89% of malware in their test dataset. While this approach is powerful to cause confusion, it takes 1.5 s to run the algorithm on a single PDF file, and it requires a vector machine (VM). Zhang [<xref ref-type="bibr" rid="ref-24">24</xref>] recently developed a detection method superior to eight antiviruses in the market using a large dataset containing more than 100,000 files with 13,000 malwares. His detection method entails running a multi-layer perceptron (MLP) on 48 manually selected features, and the achieved detection rate is approximately 95%, with an 0.1% FPR. Usually, a PDF reader requires a predefined set of keywords (tags) in the file content to render services, such as displaying links, opening images, and executing actions. Alternatively, Maiorca et al. [<xref ref-type="bibr" rid="ref-25">25</xref>] produced features using keywords. They specify how to select the incorporated features in three basic steps: First, split the dataset into benign and malicious samples. Second, use a K-Means clustering algorithm to divide the tags into frequent and infrequent sets. Lastly, merge the frequent tags of both classes and begin using them as features. When they applied these steps to a dataset of 12,000 files, they generated 168 features. Using a test set of 9,000 files, they trained a random forest classifier and achieved a detection rate of 99.55% with 0.2% FPR. The results are perfect and similar to those of previous studies. Still, there is a high risk of overfitting from the random splitting of the data and the test malware when they are not necessarily older than the training data. This is commonplace when malicious files use JavaScript of ActionScript exploits. Thus, we can determine the outcome success rate by detecting those two tags. Similar to Extensible Markup Language (XML), PDF displays a hierarchical structure. &#x0160;rndi&#x0107; et al. [<xref ref-type="bibr" rid="ref-26">26</xref>] proposed a new method based on an open-source parser to recover the file&#x0027;s tree-like structure. Their method constructs the feature by concatenating all the tags and mapping the path from the root to a tree leaf rather than using a single tag to make the feature. Subsequently, training a decision tree and an SVM model, the paths alone are observed more than 1,000 times in the dataset. The evaluation step was performed using a large dataset of 660,000 files, including 120,000 malwares obtained largely from the VirusTotal platform. The process of obtaining data involves conducting a few experiments over different periods. The researchers used over four weeks of VirusTotal data to train their algorithm during that study and repeated the experiment six times for evaluation. Overall, they obtained a detection rate of 87% with an 0.1% FPR. There is a necessity for prior domain knowledge to use those approaches due to the reliance on manual feature selection or the parsing of the PDF structure. Recently, Jeong et al. [<xref ref-type="bibr" rid="ref-27">27</xref>] applied a CNN directly to the byte code to detect malicious streams inside PDF files. Their methodology is based on utilizing the embedding layer at the top of their neural network to transform the first 1,000 bytes of a stream into vectors. Next, they used more standard machine learning methods to train multiple networks and then compared them, resulting in a detection rate of 97% and a precision of 99.7%. However, this experiment raises several concerns. First, Jeong et al. limited the search with the convolutional filters to focus on a specific string with JavaScript, a small dataset of 1,978 streams, all containing malicious streams with JavaScript embedded.</p>
<p>Additionally, instead of comparing the networks with tag-level feature extraction, as described previously, their standard only compares the networks with other machine learning models at the byte level. Over the last few years, there have been many attempts to apply deep learning in detection, especially in detecting malware in executable files (type exe). David et al. [<xref ref-type="bibr" rid="ref-28">28</xref>] exploited the behavior of executables to automatically produce signatures using a deep belief network with denoising autoencoders. They extracted API calls and produced 5,000 one-hot encoded features from them by running the files in a sandbox. The resulting output of the classifier network was signatures containing 30 features, achieving 98.6% accuracy with the dataset used. Pascanu et al. [<xref ref-type="bibr" rid="ref-29">29</xref>] advanced the use of deep learning on API calls. They used an echo state network and a recurrent neural network to embed the malware behavior. In the network training phase, they predicted the next API call and determined the features of a classifier from the last hidden state. They achieved a detection rate of 98.3% with an 0.1% FPR. Saxe et al. [<xref ref-type="bibr" rid="ref-30">30</xref>] used static analysis of executable files to manually extract features. This analysis is performed using a four-layer perceptron model and detects 95% of the malicious files in their dataset with 0.1% FPR.</p>
<p>Regarding applications that use CNNs, Raff et al. [<xref ref-type="bibr" rid="ref-31">31</xref>] utilized a CNN to detect malicious executables on raw bytes. Their methodology extracts the matrix as an input to the CNN from the embedding layer and then converts the first bytes of the file into a matrix. Essentially, they obtained their dataset from two different sources, and it has more than 500,000 files. Raff et al. achieved 90% balanced accuracy. While the resulting outcome is clearly inferior to that of the other approaches, their approach does not require preprocessing of the data and yields efficient predictions. Accuracy is enhanced by 1.2% with the inclusion of more training files (two million).</p>
<p>Based on this literature review, there are various proposed methods for malicious PDF detection, including machine learning methods, and these approaches achieve satisfactory results with traditional datasets. However, as long as there is a continued evolution of the machine learning models used by attackers and defenders, it is certain that new and more advanced adversarial strains will be produced that evade existing detectors. Therefore, the creation of stable detection models with robust classification efficiency is an open challenge.</p>
</sec>
<sec id="s4">
<label>4</label>
<title>Dataset</title>
<p>A malicious PDF document primarily uses JavaScript for cyberattacks; therefore, every malicious PDF document contains JavaScript codes in some form. However, JavaScript is seldom found in legitimate documents. However, it could lead to overfitting if JavaScript alone is used as the basis for detection because some legitimate documents contain JavaScript. According to Falah et al. [<xref ref-type="bibr" rid="ref-6">6</xref>], a dataset includes various types of attacks, including PowerShell downloaders, URL downloaders, executable malware, and shellcode downloaders. To encompass all of these attack classes, we obtained all modern malicious documents provided by the VirusTotal and Contagio datasets, as discussed subsequently.</p>
<p>In this study, the dataset used was obtained from two different sources:</p>
<p>1. VirusTotal: There are 10,603 malicious files sourced from this platform, obtained in December 2017.</p>
<p>2. Contagio [<xref ref-type="bibr" rid="ref-32">32</xref>]: There are a total of approximately 20,000 malicious and clean PDF files sourced from this platform, obtained in November 2017.</p>
<p>More than 30,000 files were used in this study for the training of different machine learning models. The distribution of benign and malicious files is presented in <?A3B2 "tbl1",5,"anchor"?><xref ref-type="table" rid="table-1">Tab. 1</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Dataset obtained from different sources</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Benign</th>
<th>Malicious</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>VirusTotal</td>
<td>0</td>
<td>10,603</td>
<td>10,603</td>
</tr>
<tr>
<td>Contagio</td>
<td>9,087</td>
<td>11,107</td>
<td>20,194</td>
</tr>
<tr>
<td>Total</td>
<td>9,087</td>
<td>21,710</td>
<td>30,797</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5">
<label>5</label>
<title>Methodology</title>
<sec id="s5_1">
<label>5.1</label>
<title>Feature Extraction from Object Content Based on Encoding and N-Gram</title>
<p>The n-gram analysis cannot be applied directly for our purposes because it increases the space occupied by the features, making it difficult to execute the process of selecting features. Furthermore, it increases the sensitivity of features that eliminate the stability and robustness of the model, paving the way for evasion through simple code confusion. Such issues are overridden <italic>via</italic> object content replacement based on the object type. These principles are based on the dependence of neighboring types rather than specific characters. For example, a JavaScript code such as Doc &#x003D; DocumentApp.openById (&#x201C;&#x003C;my-id&#x003E;&#x201D;) can be replaced with ABBGABBBBBBABBHBBBABABDLGBBJBBGLD. After this encoding, the number of unique elements is reduced to less than 30 (<?A3B2 "tbl2",5,"anchor"?><xref ref-type="table" rid="table-2">Tab. 2</xref>). Similarly, the respective features are less sensitive to modification due to code confusion and encoding. Not only does it improve robustness, but it also makes it easy to reduce the feature dimensions.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Object content replacement rules applied to the objects present in PDF files</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Type</th>
<th>Instance</th>
<th>Replace with</th>
</tr>
</thead>
<tbody>
<tr>
<td>Whitespace</td>
<td><inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>&#x03C9;</mml:mi></mml:math></inline-formula>n <inline-formula id="ieqn-1a"><mml:math id="mml-ieqn-1a"><mml:mi mathvariant="normal">&#x2216;</mml:mi></mml:math></inline-formula>t <inline-formula id="ieqn-1b"><mml:math id="mml-ieqn-1b"><mml:mi mathvariant="normal">&#x2216;</mml:mi></mml:math></inline-formula>r</td>
<td>None</td>
</tr>
<tr>
<td>Uppercase</td>
<td>A&#x2013;Z</td>
<td>A</td>
</tr>
<tr>
<td>Lowercase</td>
<td>a&#x2013;z</td>
<td>B</td>
</tr>
<tr>
<td>Digit</td>
<td>0&#x2013;9</td>
<td>C</td>
</tr>
<tr>
<td>Parentheses</td>
<td>()</td>
<td>D</td>
</tr>
<tr>
<td>Brackets</td>
<td>[]</td>
<td>E</td>
</tr>
<tr>
<td>Braces</td>
<td></td>
<td>F</td>
</tr>
<tr>
<td>Comparison operator</td>
<td>&#x003E; / &#x003C; / &#x003C;&#x003D; / &#x003E;&#x003D; / &#x003D;&#x003D;</td>
<td>G</td>
</tr>
<tr>
<td>Separator</td>
<td>, / . / : / ;</td>
<td>H</td>
</tr>
<tr>
<td>Keywords</td>
<td>if/else/while/for/ &#x2026;</td>
<td>I</td>
</tr>
<tr>
<td>Operator</td>
<td>&#x002B; / &#x2212; / &#x002B;&#x003D; / &#x2212;&#x003D;/ &#x003D; / &#x2026;</td>
<td>J</td>
</tr>
<tr>
<td>Logical operator</td>
<td>&#x0026;&#x0026; / || / and / or</td>
<td>K</td>
</tr>
<tr>
<td>Quotation</td>
<td>&#x2018; / &#x201C;&#x201D;</td>
<td>L</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Similar to the study by He et al. [<xref ref-type="bibr" rid="ref-5">5</xref>], the feature extraction method is applied to a benign dataset, and then abnormality is performed. The steps of the feature extraction based on encoding and n-gram are as follows:</p>
<p>1. PDF files are parsed and extracted.</p>
<p>2. Depending on their type, objects are added to different datasets.</p>
<p>3. The respective objects are decoded and decrypted as string content based on the PDF format specifications for each dataset.</p>
<p>4. Encoding is performed (<xref ref-type="table" rid="table-2">Tab. 2</xref> [<xref ref-type="bibr" rid="ref-5">5</xref>]), and the feature set is generated using the n-gram approach.</p>
<p>5. Frequencies are computed for feature occurrence and passed through a threshold to filter out features with fewer occurrences and negligible effects.</p>
<p>6. The final features dataset is created by combining all the features extracted in Step 1 to Step 2 and fed into the machine learning model.</p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Feature Extraction Based on the File Structure</title>
<p>A PDF file can be represented by a tree-like structure, with each node representing different objects in the file. In this study, we incorporated both the horizontal and vertical relationships among the objects mentioned by He et al. [<xref ref-type="bibr" rid="ref-5">5</xref>]. Our task is to classify malicious and benign PDF files based on the information present in these files; therefore, objects in PDF files were organized in an adjacency matrix that captures the file structure, which is also called a structure matrix. The process of extracting features based on the file is given below:</p>
<p>1. PDF files are parsed, and different objects are extracted.</p>
<p>2. Low-frequency types are filtered out.</p>
<p>3. Both benign and malignant types are combined.</p>
<p>4. The feature matrix and feature values are generated by combining objects with similar names based on functionality.</p>
<p>Further, the same process described by He et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] is applied. The features extracted using object encoding and the n-gram were used to train the k-mean clustering model for each object type to detect abnormal objects. In parallel, the structure matrices were computed and further extended by adding new dimensions, showing the intermediate results for the respective types. These extended structure matrices were used to train the CNN model embedded with the new regularization, as explained in the following sections.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Control Complexity</title>
<p>The primary consideration with machine learning is that an algorithm can be created that works efficiently, both with training data and with new data. The &#x201C;no free lunch&#x201D; theorem suggests that every individual task requires a machine learning algorithm&#x2019;s customized design. Learning machines have collections of incorporated preferences and strategies suited to tuning the problem they are addressing. Such preferences and strategies aimed at the central objective of improving generalization are together referred to as regularization. With deep learning, various regularization methodologies are now accessible due to the substantial quantity of parameters involved. Most of the research in recent years has been devoted to developing regularization strategies that work more effectively.</p>
<p>With machine learning, data sample points can be divided into two components: a pattern and stochastic noise. All machine learning algorithms must have the capability to model the pattern while ignoring the noise. Machine learning algorithms that stress finding a fit for the noise and other peculiarities will tend toward overfitting along with the pattern. In such instances, regularization can help select a suitable level of model complexity for improving the algorithm&#x2019;s predictions.</p>
<p>In general, there are dangers of overfitting with complex models that have no training errors and high testing errors. On occasion, it is preferable to have robust false assumptions rather than weak correct assumptions because weak correct assumptions require additional data to prevent overfitting. Overfitting can occur in several ways and is not always easily identifiable. One means of understanding overfitting is the decomposition of generalization errors into variance and bias.</p>
<sec id="s6_1">
<label>6.1</label>
<title>Lasso Regression (L1) Regularization</title>
<p>An L1 regularizer [<xref ref-type="bibr" rid="ref-33">33</xref>] imposes penalties on the absolute values of the weight matrix to prevent it from reaching a higher value. The primary advantage of the L1 regularizer is that it imposes a weight value reduction to zero for less significant features, which are not required for the definition of classifier boundaries but are selected or reduced by the L1 regularizer. <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref> demonstrates that the absolute size of a regression coefficient has been penalized. The L1 regularizer can also implement reductions in variability and improve the accuracy of a linear regression model. This regularizer is notated mathematically as follows</p>
<p><disp-formula id="eqn-1">
<label>(1)</label>
<mml:math id="mml-eqn-1" display="block"><mml:mi>&#x03BB;</mml:mi><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>where n is the number of features within the dataset, <inline-formula id="ieqn-1c"><mml:math id="mml-ieqn-1c"><mml:mi mathvariant="normal">&#x2216;</mml:mi></mml:math></inline-formula> is the matching weight value for each feature, and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> represents the regularization penalty.</p>
<p>However, if the predicted numbers exceed the observed numbers, the L1 regularizer will select, as a maximum, n predictors as non-zeros, even when every predictor has relevance (or can be employed in the test set). In these instances, the L1 regularizer may have difficulty coping with this form of data. In a case where two or more highly collinear variables are presented, the L1 regularizer regression will choose one of them at random, which is unsuitable for data interpretation.</p>
</sec>
<sec id="s6_2">
<label>6.2</label>
<title>Ridge Regression (L2) Regularization</title>
<p>An L2 regularizer [<xref ref-type="bibr" rid="ref-33">33</xref>] introduces modification into the residual sum of squares (RSS) <italic>via</italic> the addition of a penalty equivalent to the square of the magnitude of the coefficients. However, this is considered a suitable technique only when there is multicollinearity (high correlation between independent variables) in the data. With multicollinearity, although the ordinary least squares (OLS) estimates do not have bias, they have a large variance, which shifts the observed values away from the true values considerably. The addition of an element of bias in regression estimates results in ridge regression, which causes a reduction in quality errors. This generally resolves multicollinearity problems <italic>via</italic> the shrinkage parameter <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula>. Inconstantly, it is unnecessary to have the weight value at zero for insignificant features. In this case, the corresponding coefficient values of the features have reductions applied but remain above zero. In such cases, the square magnitude values are examined from the weight matrix, referred to as the L2 regularizer. <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref> shows this regularizer in mathematical notation. The <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> parameter is employed to impose additional penalties on the value of corresponding weight values. Because it is responsible for controlling the magnitude of coefficient values, it is essential that a good <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> value is selected.</p>
<p><disp-formula id="eqn-2">
<label>(2)</label>
<mml:math id="mml-eqn-2" display="block"><mml:mi>&#x03BB;</mml:mi><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> is the regularization penalty, n is employed for every feature, and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>&#x03C9;</mml:mi></mml:math></inline-formula> is employed for the coefficient value of every feature.</p>
<p>The L1 regularizer is different from the L2 regularizer because it implies absolute values and not square values for the penalty function. This means that penalties are imposed (or equivalent constraints for the estimated sum of the absolute values), which results in certain parameter estimates being calculated as precisely zero. The higher the applied penalty, the further the estimates shrink in the direction of absolute zero. This facilitates variable selection from a mandated range of n variables.</p>
</sec>
<sec id="s6_3">
<label>6.3</label>
<title>New Regularization</title>
<p>In the machine learning discipline, the most frequently employed regularization methods are L1-norm and L2-norm. In the optimization discipline, such regularizers assess weight complexity to create a more general mapping in the network. The L1-norm imposes a penalty on the sum of absolute values, while the L2-norm imposes a penalty on the sum of squared values. The new regularization form takes the standard deviation of the weight matrix and multiples it by <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> to create the new regularizer term. Thus, the new regularizer calculates the standard deviation of the weights using the loss function. Having examined the L1 and L2 regularizers, we discovered a significant problem: the individual values for weights were regulated without considering the correlation of the weight matrix entries. To resolve this problem, the new regularizer employs standard deviation to derive the regularization term. This constructs an adaptive means of weight decay. Thus, the regularizer does not permit a wide range of values from the weight space of the learning model. <xref ref-type="disp-formula" rid="eqn-3">Eqs. (3)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-5">(5)</xref> express the mathematical formula of this new regularizer.</p>
<p><disp-formula id="eqn-3">
<label>(3)</label>
<mml:math id="mml-eqn-3" display="block"><mml:mi>&#x03BB;</mml:mi><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> represents the regularization penalty, which limits the weight matrix from growing and dispersing; k represents the number of filters in a convolutional layer; i represents the number of rows in the weight matrix; and <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> represents the standard deviation of the weight values.</p>
<p>Next, the mathematical formula for calculating <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula></p>
<p><disp-formula id="eqn-4">
<label>(4)</label>
<mml:math id="mml-eqn-4" display="block"><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C9;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mo>{</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>}</mml:mo></mml:mrow></mml:msqrt></mml:math>
</disp-formula></p>
<p>where k represents the row count, and i is the i<sup><roman>th</roman></sup> row in the weight matrix. <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> is a parameter employed in controlling the values of the weight matrix, with n representing the number of columns in the i<sup><roman>th</roman></sup> row of the weight matrix (<inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>n</mml:mi></mml:math></inline-formula> <italic>is dependent on the number of features in the dataset</italic>). Thus, n represents the size of the weight vector, <italic>i.e</italic>., the loss function for this instance becomes</p>
<p><disp-formula id="eqn-5">
<label>(5)</label>
<mml:math id="mml-eqn-5" display="block"><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>{</mml:mo><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>y</mml:mi><mml:mo>&#x003A;</mml:mo><mml:mi>&#x03C9;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C9;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>Hence, we minimize the loss function of <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi mathvariant="bold-italic">&#x03C9;</mml:mi></mml:math></inline-formula>, using the standard deviation of <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi mathvariant="bold-italic">&#x03C9;</mml:mi></mml:math></inline-formula> to adopt the values within a specified range to make a gainful trade, significant reduction of variance although not excessively increasing the bias.</p>
<p><?A3B2 "fig2",5,"anchor"?><xref ref-type="fig" rid="fig-2">Fig. 2</xref> illustrates the feasible region for L1 and L2 and the new regularization technique. We show the contour of the new regularizer, which highlights its power and effectiveness. The various loss values are represented by the contour of each regularizer. The L2-norm behaves in a circular fashion, incorporating the L1-norm. However, the new regularizer behaves in a parabolic fashion and extends values past the limit of the L2-norm. This is somewhat helpful, as it leads to an increase in the limit values (space) for adoption, and the space may be expanded based on the penalty term <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula>. Thus, the penalization imposed for weight decay is insignificant in comparison to all the costs and simply facilitates moving the optimal point to a safe region. Based on these observations, we determined that this methodology is effective for implementing regularization.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Contours of the L2 (left) and L1 (middle) regularizers and the new (right) regularizer. Here, we demonstrate the coefficients &#x00DF;<sub>1</sub> and &#x00DF;<sub>2</sub> at the global minimum</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-2.png"/>
</fig>
</sec>
</sec>
<sec id="s7">
<label>7</label>
<title>Model Architecture and Training</title>
<sec id="s7_1">
<label>7.1</label>
<title>Convolutional Neural Network (CNN) Integrated with the New Regularization</title>
<p>The CNN models used in this study have three convolutional layers, inspired by Fettaya et al. [<xref ref-type="bibr" rid="ref-34">34</xref>]. This model is integrated with the new regularization described in the preceding sections. The first convolutional layer has a window size of eight, a stride of two, and 20 kernels. The second layer has the same window size and stride, but the number of kernels is increased to 40.</p>
<p>Similarly, the window size in the last convolutional layer is reduced to four with a stride of two, and the number of kernels is increased to 80. All these layers are immediately followed by a batch-normalization layer, as shown in <?A3B2 "fig3",5,"anchor"?><xref ref-type="fig" rid="fig-3">Fig. 3</xref>. Ultimately, 256 fully connected layers and a classification layer are inserted. Furthermore, each convolution layer and the fully connected layers are directly followed by a rectified linear unit (ReLu) in the CNN model.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Proposed CNN architecture based on new regularization</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-3.png"/>
</fig>
</sec>
<sec id="s7_2">
<label>7.2</label>
<title>Ensemble Model</title>
<p>In addition to the neural network model, we trained an ensemble model based on an SVM [<xref ref-type="bibr" rid="ref-35">35</xref>] with three different kernels. The kernel is a popular method used in SVMs to make data more separable and distinguishable while projecting data to higher dimensions. In the first step, the structure matrix (the values of the structure matrix) is flattened to a single dimension with 2,500 elements. The size of the input is restricted to 2,500 as input to the SVM. The types of kernels used in an SVM are explained below.</p>
<p>&#x2022; Gaussian kernel</p>
<p>The Gaussian kernel is a popular kernel used in SVM. It adds a <italic>bump</italic> around each data point [<xref ref-type="bibr" rid="ref-35">35</xref>]. Mathematically it can be expressed as <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>.</p>
<p><disp-formula id="eqn-6">
<label>(6)</label>
<mml:math id="mml-eqn-6" display="block"><mml:mi>K</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:math></inline-formula> are the two feature vectors, and the <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> value is set to 0.1. We call it SVM-1, which has a Gaussian kernel integrated.</p>
<p>&#x2022; Polynomial kernel</p>
<p>The polynomial kernel is commonly described as expressed in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>. It is a directional function with a point product in the kernel that depends on the directions of two vectors in a low-dimensional space.</p>
<p><disp-formula id="eqn-7">
<label>(7)</label>
<mml:math id="mml-eqn-7" display="block"><mml:mi>K</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math>
</disp-formula></p>
<p>We considered two SVM classifiers with polynomial kernels of degree d &#x003D; 2 (SVM-2) and d &#x003D; 3 (SVM-3). The SVM-based ensemble model is illustrated in <?A3B2 "fig4",5,"anchor"?><xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Proposed SVM-based ensemble architecture</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-4.png"/>
</fig>
</sec>
<sec id="s7_3">
<label>7.3</label>
<title>Training and Testing</title>
<sec id="s7_3_1">
<label>7.3.1</label>
<title>Training and Testing of the CNN Model</title>
<p>The CNN model integrated with the proposed regularization method was trained using the features extracted <italic>via</italic> the methods in the preceding section. There are 30,797 PDF files in the dataset, and 40% of this data (sourced from both the VirusTotal and Contagio datasets), totaling 12,970 PDF files, was used in the training. For testing, 15% disjoint data from both categories were used. To make the dataset nearly balanced, of this 40% used, 60% of the files from the Contagio dataset were benign and 40% were malicious files. Hence, the number of benign and malicious PDF files used for the training was 5,500, and 7,470, respectively. The same ratio was applied to test data from the Contagio dataset.</p>
</sec>
<sec id="s7_3_2">
<label>7.3.2</label>
<title>Training and Testing of the Ensemble Model</title>
<p>The SVM-1, SVM-2, and SVM-3 classifiers were trained and tested in parallel fashion on the data presented in <?A3B2 "tbl3",5,"anchor"?><xref ref-type="table" rid="table-3">Tabs. 3</xref> and <?A3B2 "tbl4",5,"anchor"?><xref ref-type="table" rid="table-4">4</xref>. The malicious and benign decision was made based on majority vote. For example, if SVM-1 and SVM-3 called an instance malicious and SVM-2 called it benign, it was considered malicious.</p>
</sec>
</sec>
<sec id="s7_4">
<label>7.4</label>
<title>Hyper-Parameters Adjustment</title>
<p>To train the model effectively, the number of batches is restricted to 64, and the learning rate is initially set to 2 <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 10<sup>&#x2212;1</sup> and is exponentially decayed after each 20 epochs. The dropout rate is set to zero, as the proposed regularization method is used to handle the complexity of the model. The regularization is inserted into every layer of the model to tune the kernel parameters. We recall that the number of kernels used is 20, 40, and 80 in the first, second, and last layers, respectively. Similarly, each layer is followed by an ReLu activation function.</p>
<p>The experiments in this study were performed on a computer system with 64 GB of RAM, four 2.4 GHz Intel CPUs, and a GeForce GTX 1080 TI GPU packed with a massive 11 GB frame buffer.</p>
</sec>
</sec>
<sec id="s8">
<label>8</label>
<title>Experiments and Results</title>
<sec id="s8_1">
<label>8.1</label>
<title>Evaluation Metrics</title>
<p>To evaluate the performance of our model, we computed different evaluation metrics, including accuracy, precision, recall, and F1 score. Each of these metrics is expressed in <xref ref-type="disp-formula" rid="eqn-8">Eqs. (8)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-11">(11)</xref>. In these equations, true positives (TP) are the samples correctly classified as malicious, while false positives (FP) are benign samples classified as malicious. Similarly, true negatives (TN) are samples correctly classified as benign, while false negatives are malicious samples classified as benign.</p>
<p>Accuracy is computed as the ratio of correct predictions divided by the total number of predictions (<xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>). The true positive rate (TPR) or recall is derived by dividing the predicted number of malicious samples by the total number of actual malicious samples in the test data (<xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref>). Similarly, precision is derived by dividing the predicted number of malicious samples by the number of all samples identified by the classifier as malicious (<xref ref-type="disp-formula" rid="eqn-10">Eq. (10)</xref>). It can be seen from <xref ref-type="disp-formula" rid="eqn-9">Eqs. (9)</xref> and <xref ref-type="disp-formula" rid="eqn-10">(10)</xref> that an increase in precision causes a decrease in recall, and vice versa; hence, the F1 score, which is computed to maintain balance between the two measures (<xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>).</p>
<p><disp-formula id="eqn-8">
<label>(8)</label>
<mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:mi mathvariant="italic">A</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-9">
<label>(9)</label>
<mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mi mathvariant="italic">R</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">l</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-10">
<label>(10)</label>
<mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mi mathvariant="italic">P</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-11">
<label>(11)</label>
<mml:math id="mml-eqn-11" display="block"><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="italic">P</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi mathvariant="italic">R</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">l</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="italic">P</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="italic">R</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">l</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:math>
</disp-formula></p>
</sec>
<sec id="s8_2">
<label>8.2</label>
<title>Prediction Performance of Proposed Models</title>
<p>The proposed CNN and ensemble models are trained with over 12,900 PDF files, and the number of benign and malicious files used in the training process is 5,500, and 7,400, respectively. The distribution of the PDF files obtained for training, both the VirusTotal and the Contagio datasets, are presented in <xref ref-type="table" rid="table-3">Tab. 3</xref>.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Number of samples used in training both models, taken from both datasets, <italic>i.e</italic>., VirusTotal and Contagio</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Benign</th>
<th>Malicious</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>VirusTotal</td>
<td>0</td>
<td>4,200</td>
<td></td>
</tr>
<tr>
<td>Contagio</td>
<td>5,500</td>
<td>3,200</td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td>5,500</td>
<td>7,400</td>
<td>19,900</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>After training, the performance of both models is evaluated using a test dataset. The distribution of the test data from both the VirusTotal and Contagio datasets is presented in <xref ref-type="table" rid="table-4">Tab. 4</xref>. The number of samples is decided such that the test data is almost balanced. The total number of files used for the evaluation of the model is 2,963.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Number of samples used to test both CNN and ensemble models taken from both datasets (VirusTotal and Contagio)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Benign</th>
<th>Malicious</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>VirusTotal</td>
<td>0</td>
<td>1,300</td>
<td></td>
</tr>
<tr>
<td>Contagio</td>
<td>1,363</td>
<td>300</td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td>1,363</td>
<td>1,600</td>
<td>2,963</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The confusion matrix of the trained CNN model is shown in <?A3B2 "fig5",5,"anchor"?><xref ref-type="fig" rid="fig-5">Fig. 5</xref>. The accuracy and precision of the CNN model is almost 100%. The recall and F1 scores are 99.9% and 99.8%, respectively. All measures of the CNN model are presented in <?A3B2 "tbl5",5,"anchor"?><xref ref-type="table" rid="table-5">Tab. 5</xref>.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Confusion matrix of the CNN model on the test data</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-5.png"/>
</fig>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Performance of the CNN model on test data computed based on the confusion matrix in <xref ref-type="fig" rid="fig-5">Fig. 5</xref></title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Measure</th>
<th>CNN performance (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Accuracy</td>
<td>99.93</td>
</tr>
<tr>
<td>Precision</td>
<td>100</td>
</tr>
<tr>
<td>Recall</td>
<td>99.90</td>
</tr>
<tr>
<td>F1</td>
<td>99.94</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The accuracy of the ensemble model is 97.3%, while the precision and recall are 95.6% and 97.2%, respectively. The relative performance of the SVM-based ensemble model is lower than that of the CNN-based model. This is because the representation power of the neural network based model is higher than that of the other models because of their complexity and learning abilities. The task here is to classify the features and learn from the features where the CNN performs this task easily.</p>
<p>All measures of the SVM-based ensemble model are presented in <?A3B2 "tbl6",5,"anchor"?><xref ref-type="table" rid="table-6">Tab. 6</xref>.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Performance of the SVM-based ensemble model on test data</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Measure</th>
<th>Performance of the ensemble model (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Accuracy</td>
<td>97.30</td>
</tr>
<tr>
<td>Precision</td>
<td>95.60</td>
</tr>
<tr>
<td>Recall</td>
<td>97.20</td>
</tr>
<tr>
<td>F1</td>
<td>96.40</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The area under the ROC curve (AUC-ROC) of the CNN model and ensemble model is 99.4% and 94.3%, respectively. <?A3B2 "fig6",5,"anchor"?><xref ref-type="fig" rid="fig-6">Figs. 6</xref> and <?A3B2 "fig7",5,"anchor"?><xref ref-type="fig" rid="fig-7">7</xref> present the ROC curve of both proposed models.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>AUC-ROC of the proposed CNN model</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-6.png"/>
</fig>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>AUC-ROC of the ensemble model</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-7.png"/>
</fig>
</sec>
<sec id="s8_3">
<label>8.3</label>
<title>Comparison of the Proposed Model with Other Models</title>
<p>In this section, the performance of our model in terms of accuracy, recall, precision, and F1 measure is compared with state-of-the-art models from other studies. Our proposed CNN-based model outperforms every other model presented in <?A3B2 "tbl7",5,"anchor"?><xref ref-type="table" rid="table-7">Tab. 7</xref>. The precision of the He et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] model and our proposed model is 100%, while our CNN-based model scores slightly higher on the other measures. The performance of the SVM-based ensemble model is approximately similar to that of the model put forward by Falah et al. [<xref ref-type="bibr" rid="ref-6">6</xref>]. However, the performance of the ensemble model is lower than that of models from other recent studies compared in <xref ref-type="table" rid="table-7">Tab. 7</xref>. A graphical representation of the comparison with the different models is presented in <?A3B2 "fig8",5,"anchor"?><xref ref-type="fig" rid="fig-8">Fig. 8</xref>.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Comparison of accuracy, recall, precision, and F1 measures with models from different studies</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Accuracy (%)</th>
<th>Recall (%)</th>
<th>Precision (%)</th>
<th>F1 (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Falah et al. [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>97.4</td>
<td>96.7</td>
<td>98.6</td>
<td>97.5</td>
</tr>
<tr>
<td>He et al. [<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td>99.9</td>
<td>98.49</td>
<td>100</td>
<td>99.24</td>
</tr>
<tr>
<td>Chen et al. [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>99.74</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>Proposed CNN</td>
<td>99.93</td>
<td>99.9</td>
<td>100</td>
<td>99.94</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Comparison of Accuracy, Recall, and F1 measures of the proposed CNN model with different state-of-the-art models</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-8.png"/>
</fig>
</sec>
</sec>
<sec id="s9">
<label>9</label>
<title>Robustness Evaluation</title>
<p>To avoid evasion, the robustness of a detection model for malicious PDF documents is indispensable. Therefore, the robustness of the proposed classifier is evaluated using adversarial samples generated <italic>via</italic> Mimicus, a Python library for adversarial classifier evasion [<xref ref-type="bibr" rid="ref-14">14</xref>].</p>
<p>Because the feature extraction process uses unsupervised learning, the detection process uses supervised learning; therefore, the evasion of the unsupervised model is difficult due to collision resistance in the selected features. However, if the unsupervised model is evaded by the adversarial samples, then the supervised model utilizes the structural information of the input features. Therefore, it would be difficult to evade the proposed model by explicitly or implicitly placing hidden malicious data in the input file.</p>
<p>The workings of Mimicus are based on the set of features, the training data, and the classification algorithm. We generated approximately 30 samples, a similar number as in the study by He et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] and tested our proposed models on these adversarial samples. The results show that all adversarial samples generated under these conditions are detected by the proposed classifiers. In comparison, several state-of-the-art models from the study by Maiorca et al. [<xref ref-type="bibr" rid="ref-37">37</xref>] have shown excellent performance on training samples. However, their performance decreases when tested on adversarial samples generated <italic>via</italic> Mimicus (testing set) [<xref ref-type="bibr" rid="ref-5">5</xref>]. In the face of adversarial attacks, the performance of our model is superior to that of other state-of-the-art models. It is evident from <?A3B2 "fig9",5,"anchor"?><xref ref-type="fig" rid="fig-9">Fig. 9</xref> that our model detected each adversarial sample 100% correctly.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Robustness accuracy of various models computed using adversarial samples generated vias Mimicus. The accuracy of our proposed model is almost 100%</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_18260-fig-9.png"/>
</fig>
</sec>
<sec id="s10">
<label>10</label>
<title>Limitation</title>
<p>Various methods have been proposed to detect malicious PDF files with comparable accuracy [<xref ref-type="bibr" rid="ref-5">5</xref>&#x2013;<xref ref-type="bibr" rid="ref-37">37</xref>], all of which can be integrated into real-time applications. The detection performance of our proposed method is superior to that of other existing methods. Although, there is a feature extraction overhead involved, which might delay the throughput of real-time applications. The two types of feature extraction that make the detection possible with 100% precision are object, content, features, and structure-based features, which are time consuming. However, apart from the time versus precision trade-off, our proposed method can be used in applications where high security is desired against intentionally evasive attacks.</p>
</sec>
<sec id="s11">
<label>11</label>
<title>Conclusion and Future Directions</title>
<p>In this study, we proposed two models for detecting malicious PDF files. The first model is a CNN model integrated with a new regularization, and the second model is an SVM-based ensemble model with three different kernels. A feature extraction method is used to extract features based on the object content and file structure in the first step. In the next step, two different classifiers are trained to detect malicious documents. Both models were trained and validated using two datasets. The first model possesses an advantage over other regularizations, optimizing weight values based on the standard deviation. It restricts the dispersion of weight parameters, making the conversion fast, preventing overfitting of the model, and improving performance. Thus, the model leverages feature transformation by adding three convolutional layers, yielding promising performance. In addition, the second model yields comparable results. However, the CNN-based model outperforms all the other models.</p>
<p>In the future, the feature extraction methods can be modified to avoid overhead and delay from the various types of feature extraction. In addition to intensive feature extraction methods, another avenue worth exploring is the identification of discriminant features for detecting malicious PDF documents with high accuracy, precision, and greater robustness accuracy.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other">
<p><bold>Funding Statement:</bold> This research work was funded by Makkah Digital Gate Initiative under Grant No. (MDP-IRI-16-2020). Therefore, authors gratefully acknowledge technical and financial support from Emirate Of Makkah Province and King Abdulaziz University, Jeddah, Saudi Arabia.</p>
</fn>
<fn fn-type="conflict">
<p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Mitre Corporation</collab></person-group>, &#x201C;<article-title>CVE details</article-title>,&#x201D; <source>Vulnerability Statistics</source>. <year>2018</year>. [Online]. Available: <uri xlink:href="https://www.cvedetails.com/product/497/Adobe-Acrobat-Reader.html?vendor_id&#x003D;53">https://www.cvedetails.com/product/497/Adobe-Acrobat-Reader.html?vendor_id&#x003D;53</uri>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Beltov</surname></string-name></person-group>, &#x201C;<article-title>PDF phishing scam campaign revealed</article-title>,&#x201D; <source>Best Security Search</source>, <year>2017</year>. [Online]. Available: <uri>https://bestsecuritysearch.com/pdf-phishing-scam-campaign-revealed/</uri>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Dan</given-names> <surname>Goodin</surname></string-name></person-group>, &#x201C;<article-title>Anti-virus protection gets worse</article-title>,&#x201D; <source>The Register</source>, <year>2007</year>. [Online]. Available: <uri>https://www.theregister.com/2007/12/21/dwindling_antivirus_protection</uri>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>U.</given-names> <surname>Bayer</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Moser</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Kuruegel</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Krida</surname></string-name></person-group>, &#x201C;<article-title>Dynamic analysis of malware code</article-title>,&#x201D; <source>Journal in Computer Virology</source>, vol. <volume>2</volume>, no. <issue>1</issue>, pp. <fpage>67</fpage>&#x2013;<lpage>77</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>He</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Lu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Detection of malicious pdf files using a two-stage machine learning algorithm</article-title>,&#x201D; <source>Chinese Journal Electronics</source>, vol. <volume>29</volume>, no. <issue>6</issue>, pp. <fpage>1165</fpage>&#x2013;<lpage>1177</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Falah</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Huda</surname></string-name>, <string-name><given-names>S. R.</given-names> <surname>Pokhrel</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Anwar</surname></string-name></person-group>, &#x201C;<article-title>Improving malicious pdf classifier with feature engineering: A data-driven approach</article-title>,&#x201D; <source>Future Generation Computer Systems</source>, vol. <volume>115</volume>, no. <issue>2</issue>, pp. <fpage>314</fpage>&#x2013;<lpage>326</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Raff</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Zak</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Cox</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Sylvester</surname></string-name>, <string-name><given-names>P. M.</given-names> <surname>Yacci</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>An investigation of byte n-gram features for malware classification</article-title>,&#x201D; <source>Journal of Computer Virology and Hacking Techniques</source>, vol. <volume>14</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>20</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Comodo Antivirus</collab></person-group>, &#x201C;<article-title>How antivirus works</article-title>,&#x201D; <source>Comodo Security Solutions</source>, <year>2020</year>. [Online]. Available: <uri>https://antivirus.comodo.com/faq/how-antivirus-works.html</uri>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Singh</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Tapaswi</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Gupta</surname></string-name></person-group>, &#x201C;<article-title>Malware detection in pdf and office documents: A survey Information Security</article-title>,&#x201D; <source>A Global Perspective</source>, vol. <volume>29</volume>, no. <issue>3</issue>, pp. <fpage>134</fpage>&#x2013;<lpage>153</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Smutz</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Stavrou</surname></string-name></person-group>, &#x201C;<article-title>Malicious pdf detection using metadata and structural features</article-title>,&#x201D; in <conf-name>Proc. of the 28th Annual Computer Security Applications Conf.</conf-name>, <conf-loc>Orlando Florida USA</conf-loc>, pp. <fpage>239</fpage>&#x2013;<lpage>248</lpage>, <year>2012</year>. </mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>&#x0160;rndi&#x0107;</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Laskov</surname></string-name></person-group>, &#x201C;<article-title>Detection of malicious pdf files based on hierarchical document structure</article-title>,&#x201D; in <conf-name>Proc. of the 20th Annual Network &#x0026; Distributed System Security Symp.</conf-name>, <conf-loc>San Diego, California, USA</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>16</lpage>, <year>2013</year>. </mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Parkour</surname></string-name></person-group>, &#x201C;<article-title>Contagio malware dump</article-title>,&#x201D; <source>Version 4 April 2011 - 11,355&#x002B; Malicious Documents-archive for Signature Testing and Research</source>, <year>2011</year>. [Online]. Available: <uri>http://contagiodump.blogspot.com/2010/08/malicious-documents-archive-for.html</uri>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Tong</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Hajaj</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Xiao</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Vorobeychik</surname></string-name></person-group>, &#x201C;<article-title>A framework for validating models of evasion attacks on machine learning with application to pdf malware detection</article-title>,&#x201D; <comment>arXiv Preprint</comment>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Biggio</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Corona</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Maiorca</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Nelson</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Srndc</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Evasion attacks against machine learning at test time</article-title>,&#x201D; in <conf-name>Joint European Conf. on Machine Learning and Knowledge Discovery in Databases</conf-name>, <conf-loc>Berlin, Heidelberg</conf-loc>, pp. <fpage>387</fpage>&#x2013;<lpage>402</lpage>, <year>2013</year>. </mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>&#x0160;rndi&#x0107;</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Laskov</surname></string-name></person-group>, &#x201C;<article-title>Practical evasion of a learning-based classifier: A case study</article-title>,&#x201D; in <conf-name>IEEE Symp. on Security and Privacy</conf-name>, <conf-loc>Berkeley, CA, USA</conf-loc>, pp. <fpage>197</fpage>&#x2013;<lpage>211</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Grosse</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Papernot</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Manoharan</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Backes</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Daniel</surname></string-name></person-group>, &#x201C;<article-title>Adversarial perturbations against deep neural networks for malware classification</article-title>,&#x201D; <comment>arXiv Preprint</comment>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Arp</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Spreitzenbarth</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hubner</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Gascon</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Rieck</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>DREBIN: Effective and explainable detection of android malware in your pocket</article-title>,&#x201D; in <source>Network and Distributed System Security Symp.</source>, <conf-loc>San Diego, California, USA</conf-loc>, vol. <volume>14</volume>, pp. <fpage>23</fpage>&#x2013;<lpage>26</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Qi</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Evans</surname></string-name></person-group>, &#x201C;<article-title>Automatically evading classifiers: A case study on pdf malware classifiers</article-title>,&#x201D; in <conf-name>Network and Distributed System Security Symp.</conf-name>, <conf-loc>San Diego, California, USA</conf-loc>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Dang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>E. C.</given-names> <surname>Chang</surname></string-name></person-group>, &#x201C;<article-title>Evading classifiers by morphing in the dark</article-title>,&#x201D; in <conf-name>Proc. of the 2017 ACM SIGSAC Conf. on Computer and Communications Security</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>119</fpage>&#x2013;<lpage>133</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Rautiainen</surname></string-name></person-group>, &#x201C;<article-title>A look at portable document format vulnerabilities</article-title>,&#x201D; <source>Information Security Technical Report</source>, vol. <volume>14</volume>, no. <issue>1</issue>, pp. <fpage>30</fpage>&#x2013;<lpage>33</lpage>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J. S.</given-names> <surname>Cross</surname></string-name> and <string-name><given-names>M. A.</given-names> <surname>Munson</surname></string-name></person-group>, &#x201C;<article-title>Deep pdf parsing to extract features for detecting embedded malware</article-title>,&#x201D; <publisher-name>Sandia National Labs</publisher-name>, <comment>SAND2011&#x2013;7982</comment>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Smutz</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Stavrou</surname></string-name></person-group>, &#x201C;<article-title>Malicious pdf detection using metadata and structural features</article-title>,&#x201D; in <conf-name>Proc. of the 28th Annual Computer Security Applications Conf.</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>239</fpage>&#x2013;<lpage>248</lpage>, <year>2012</year>. </mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Tzermias</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Sykiotakis</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Polychronakis</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Markatos</surname></string-name></person-group>, &#x201C;<article-title>Combining static and dynamic analysis for the detection of malicious documents</article-title>,&#x201D; in <conf-name>Proc. of the Fourth European Workshop on System Security</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>4</fpage>, <year>2011</year>. </mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Mlpdf: An effective machine learning based approach for pdf malware detection</article-title>,&#x201D; <comment>arXiv Preprint</comment>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Maiorca</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Giacinto</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Corona</surname></string-name></person-group>, &#x201C;<article-title>A pattern recognition system for malicious pdf files detection</article-title>,&#x201D; in <conf-name>Int. Workshop on Machine Learning and Data Mining in Pattern Recognition</conf-name>, <conf-loc>Berlin, Heidelberg</conf-loc>, pp. <fpage>510</fpage>&#x2013;<lpage>524</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>&#x0160;rndi&#x0107;</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Laskov</surname></string-name></person-group>, &#x201C;<article-title>Detection of malicious pdf files based on hierarchical document structure</article-title>,&#x201D; in <conf-name>Proc. of the 20th Annual Network &#x0026; Distributed System Security Symp.</conf-name>, <conf-loc>San Diego, CA, USA</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>16</lpage>, <year>2013</year>. </mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Jeong</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Woo</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Kang</surname></string-name></person-group>, &#x201C;<article-title>Malware detection on byte streams of pdf files using convolutional neural networks</article-title>,&#x201D; <source>Security and Communication Networks</source>, vol. <volume>2019</volume>, no. <issue>6</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>9</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>David</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Netanyahu</surname></string-name></person-group>, &#x201C;<article-title>Deepsign: Deep learning for automatic malware signature generation and classification</article-title>,&#x201D; in <conf-name>Int. Joint Conf. on Neural Networks</conf-name>, <conf-loc>Killarney, Ireland</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>8</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Pascanu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Stokes</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Sanossian</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Marinescu</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Thomas</surname></string-name></person-group>, &#x201C;<article-title>Malware classification with recurrent networks</article-title>,&#x201D; in <conf-name>2015 IEEE Int. Conf. on Acoustics</conf-name>, <conf-loc>South Brisbane, QLD, Australia</conf-loc>, pp. <fpage>1916</fpage>&#x2013;<lpage>1920</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Saxe</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Berlin</surname></string-name></person-group>, &#x201C;<article-title>Deep neural network based malware detection using two dimensional binary program features</article-title>,&#x201D; in <conf-name>10th Int. Conf. on Malicious and Unwanted Software</conf-name>, <conf-loc>Fajardo, PR, USA</conf-loc>, pp. <fpage>11</fpage>&#x2013;<lpage>20</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Raff</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Barker</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Sylvester</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Brandon</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Catanzaro</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Malware detection by eating a whole exe</article-title>,&#x201D; in <conf-name>Workshops at the Thirty-Second AAAI Conf. on Artificial Intelligence</conf-name>, <conf-loc>New Orleans, USA</conf-loc>, pp. <fpage>268</fpage>&#x2013;<lpage>276</lpage>, <year>2018</year>. </mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Contagio Malware Dump</collab></person-group>, &#x201C;<article-title>External data source</article-title>,&#x201D; [Online]. Available: <uri>http://contagiodump.blogspot.com.au</uri>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A. Y.</given-names> <surname>Ng</surname></string-name></person-group>, &#x201C;<article-title>Feature selection, l1 <italic>vs</italic>. l2 regularization, and rotational invariance</article-title>,&#x201D; in <conf-name>Twenty-First Int. Conf. on Machine Learning</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, <year>2004</year>. </mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Fettaya</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Mansour</surname></string-name></person-group>, &#x201C;<article-title>Detecting malicious pdf using CNN</article-title>,&#x201D; <comment>arXiv Preprint</comment>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Noble</surname></string-name></person-group>, &#x201C;<article-title>What is a support vector machine?</article-title>,&#x201D; <source>Nature Biotechnology</source>, vol. <volume>24</volume>, no. <issue>12</issue>, pp. <fpage>1565</fpage>&#x2013;<lpage>1567</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>She</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Jana</surname></string-name></person-group>, &#x201C;<article-title>On training robust pdf malware classifiers</article-title>,&#x201D; in <conf-name>29th USENIX Security Symp.</conf-name>, <conf-loc>(virtual)</conf-loc>, pp. <fpage>2343</fpage>&#x2013;<lpage>2360</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Maiorca</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Biggio</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Giacinto</surname></string-name></person-group>, &#x201C;<article-title>Towards adversarial malware detection: Lessons learned from pdf-based attacks</article-title>,&#x201D; <source>ACM Computing Surveys</source>, vol. <volume>52</volume>, no. <issue>4</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>36</lpage>, <year>2019</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>
