<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">74505</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2026.074505</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Ransomware Detection Approach Based on LLM Embedding and Ensemble Learning</article-title>
<alt-title alt-title-type="left-running-head">A Ransomware Detection Approach Based on LLM Embedding and Ensemble Learning</alt-title>
<alt-title alt-title-type="right-running-head">A Ransomware Detection Approach Based on LLM Embedding and Ensemble Learning</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Ghourabi</surname><given-names>Abdallah</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>aghourabi@ju.edu.sa</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Chouaib</surname><given-names>Hassen</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Department of Computer Science, College of Computer and Information Sciences, Jouf University</institution>, <addr-line>Sakaka</addr-line>, <country>Saudi Arabia</country></aff>
<aff id="aff-2"><label>2</label><institution>College of Science, Jouf University</institution>, <addr-line>Sakaka</addr-line>, <country>Saudi Arabia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Abdallah Ghourabi. Email: <email>aghourabi@ju.edu.sa</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>10</day><month>2</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>1</issue>
<elocation-id>98</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>10</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>01</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_74505.pdf"></self-uri>
<abstract>
<p>In recent years, ransomware attacks have become one of the most common and destructive types of cyberattacks. Their impact is significant on the operations, finances and reputation of affected companies. Despite the efforts of researchers and security experts to protect information systems from these attacks, the threat persists and the proposed solutions are not able to significantly stop the spread of ransomware attacks. The latest remarkable achievements of large language models (LLMs) in NLP tasks have caught the attention of cybersecurity researchers to integrate these models into security threat detection. These models offer high embedding capabilities, able to extract rich semantic representations and paving the way for more accurate and adaptive solutions. In this context, we propose a new approach for ransomware detection based on an ensemble method that leverages three distinct LLM embedding models. This ensemble strategy takes advantage of the variety of embedding methods and the strengths of each model. In the proposed solution, each embedding model is associated with an independently trained MLP classifier. The predictions obtained are then merged using a weighted voting technique, assigning each model an influence proportional to its performance. This approach makes it possible to exploit the complementarity of representations, improve detection accuracy and robustness, and offer a more reliable solution in the face of the growing diversity and complexity of modern ransomware.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Ransomware detection</kwd>
<kwd>ensemble learning</kwd>
<kwd>LLM embedding</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Deanship of Graduate Studies and Scientific Research at Jouf University</funding-source>
<award-id>DGSSR-2024-02-01176</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Ransomware is malicious software that blocks access to personal data by encrypting the data and then asks the owner to send money in exchange for the key to decrypt it [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>]. In recent years, ransomware has topped the list of the most dangerous cyberattacks, a threat that has attracted a lot of attention from both the general public and businesses. Several large commercial enterprises, healthcare facilities, and government administrations have been targeted by ransomware attacks, resulting in service disruptions and significant financial losses. According to a report published by Statista [<xref ref-type="bibr" rid="ref-3">3</xref>], ransomware actors received a total of $1.25 billion in 2023 and $813 million in 2024. These are very significant figures, demonstrating the major impact of this type of attack.</p>
<p>WannaCry is one of the most notable ransomware attacks in the history of cyberattacks. It was launched in May 2017 and spread rapidly worldwide by exploiting a vulnerability in Windows systems [<xref ref-type="bibr" rid="ref-4">4</xref>]. This attack affected more than 230,000 computers in 150 countries, with total losses ranging from hundreds of millions to billions of dollars. Several global organizations were affected by this attack, including the UK&#x2019;s National Health Service (NHS), the Russian Interior Ministry, the Bank of China, the world&#x2019;s largest delivery firm FedEx, among others.</p>
<p>The detection of ransomwares presents a major challenge for security specialists due to the rapid evolution of these malicious softwares and the use of advanced obfuscation and polymorphism techniques. This makes their detection more difficult using traditional methods. In the literature, several works using machine learning and deep learning techniques have been proposed to defend against this threat. These studies generally deal with raw data and extracted features using techniques such as Opcode sequence analysis [<xref ref-type="bibr" rid="ref-5">5</xref>], API call capture [<xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;<xref ref-type="bibr" rid="ref-9">9</xref>], PE header information analysis [<xref ref-type="bibr" rid="ref-10">10</xref>], and network traffic monitoring [<xref ref-type="bibr" rid="ref-11">11</xref>]. However, these techniques have not attempted to leverage the strength of Large Language Models (LLMs) and the ability of their embedding models to capture the richness and diversity of textual information describing ransomware behavior. In addition, existing techniques rely on a single model or representation space, which reduces the generalization ability and robustness of the learned models. This observation reveals a research gap in the development of ensemble methods based on multiple textual embeddings, which can combine heterogeneous representations to enhance the accuracy, robustness, and generalization ability of ransomware detection systems.</p>
<p>Large language models (LLMs) have revolutionized the field of natural language processing (NLP) with their ability to generate rich, contextual, and semantic representations. Their use has rapidly expanded into the field of cybersecurity thanks to their potential to improve threat analysis and detection techniques [<xref ref-type="bibr" rid="ref-12">12</xref>]. In recent years, researchers have started to employ LLMs in several areas of cybersecurity, such as intrusion detection [<xref ref-type="bibr" rid="ref-13">13</xref>,<xref ref-type="bibr" rid="ref-14">14</xref>], cyber threat intelligence [<xref ref-type="bibr" rid="ref-15">15</xref>&#x2013;<xref ref-type="bibr" rid="ref-17">17</xref>], malware detection and classification [<xref ref-type="bibr" rid="ref-18">18</xref>,<xref ref-type="bibr" rid="ref-19">19</xref>], and phishing and spam detection [<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-21">21</xref>]. These models enable researchers to design smarter and more adaptable detection systems by giving them robust and generalizable text representations.</p>
<p>In this article, we propose a novel approach for detecting ransomware based on the text embedding models of LLMs. These embedding models are useful for generating rich representations of textual features extracted from executable files. To enhance the detection accuracy of our approach, we employed an ensemble architecture encompassing three different embedding models. The approach involves designing three base learners, each associated with one of the embedding models. Each base learner estimates its prediction for the input instance individually. Then, an ensemble method based on weighted voting is applied to aggregate the outputs of the base learners in an optimized way and make a final decision on the class of the input instance. The weighted voting improves the performance of the ensemble approach by giving more importance to the most reliable models, which allows for better exploitation of their complementarities and reduces the influence of less accurate predictors on the final decision. To the best of our knowledge, the proposed approach is the first to exploit both the embedding capabilities of LLMs and the robustness of ensemble methods for ransomware detection.</p>
<p>The structure of the paper is organized as follows: <xref ref-type="sec" rid="s2">Section 2</xref> discusses related works. In <xref ref-type="sec" rid="s3">Section 3</xref>, we provide a detailed explanation of the proposed model. <xref ref-type="sec" rid="s4">Section 4</xref> presents the results of the experiments conducted. Finally, <xref ref-type="sec" rid="s5">Section 5</xref> concludes the paper.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>To limit the damage caused by ransomware attacks, it is important to improve the ability to detect ransomware as early as possible. Several ransomware detection techniques have been proposed in the literature to address this problem. In this section, we will provide a brief overview of the key research conducted on this topic.</p>
<p>To detect and classify ransomware, Zhang et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] proposed a static analysis of ransomware opcodes using deep learning algorithms. The proposed solution involves extracting opcodes from ransomware files and converting them into N-gram sequences. Then, a self-Attention based convolutional neural network and a bidirectional self-attention network are applied to the obtained opcode sequences to assign the ransomware to its family. The experimental evaluation of the approach was performed on a real dataset with eight ransomware families.</p>
<p>To avoid the complicated process of dynamic analysis, Khammas [<xref ref-type="bibr" rid="ref-22">22</xref>] proposed the use of static analysis to detect ransomware. The idea is to use frequent pattern mining to extract features from raw bytes and then apply the gain ratio technique to select only relevant features. The next step is to use the Random Forest algorithm to classify samples and detect ransomware. The experimental evaluation was performed on PE files belonging to three ransomware families downloaded from VirusTotal. According to the authors, the proposed method achieved a detection accuracy of 97.74%.</p>
<p>Another approach to ransomware detection proposed by Moreira et al. [<xref ref-type="bibr" rid="ref-10">10</xref>] involves converting portable executable header files into color images in a sequential vector model and categorizing them using an Xception Convolutional Neural Network model without transfer learning. Two datasets were used to evaluate the suggested method. On the first dataset, the model achieved an accuracy of 93.73%. On the second it achieved 98.20% accuracy.</p>
<p>Yamany et al. [<xref ref-type="bibr" rid="ref-23">23</xref>] considered a different method for analyzing ransomware. They proposed a system scheme to track and classify ransomware based on static features to determine the similarity between different ransomware samples. They used the Jaccard Index to calculate the similarity and performed clustering on these samples to categorize them into classified groups.</p>
<p>In other research papers, the authors have opted for a more dynamic approach, which consists of analyzing the behavior of the ransomware during execution and monitoring its activities. For example, in [<xref ref-type="bibr" rid="ref-6">6</xref>], Al-rimy et al. proposed to analyze the evolution of crypto-ransomware behavior in the different attack phases, applying an incremental bagging technique to create incremental subsets. They then apply an enhanced semi-random subspace selection technique to select the most informative features for each subspace and integrate both techniques into an ensemble-based model for ransomware detection. According to the authors, experimental evaluation has shown that their approach can overcome the problem of delayed ransomware detection and is able to detect crypto-ransomware in early stages of attacks.</p>
<p>Berrueta et al. [<xref ref-type="bibr" rid="ref-11">11</xref>] have proposed an approach for detecting crypto-ransomware in file-sharing environments. Their idea is to monitor the traffic exchanged between clients and file servers and capture activities related to opening, closing and modifying files. The detection of ransomware is based on analyzing these features using machine learning algorithms to distinguish between the activities of ransomware and those of benign applications. According to the authors, the proposed approach can work with both plaintext and encrypted file-sharing protocols.</p>
<p>Urooj et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] proposed a method for weighting Generative Adversarial Networks (GANs) to detect the behavior of ransomware attacks. They first used TF-IDF to capture API calls from a dynamic analysis during the pre-encryption phase of the attack. Next, they employed GANs to increase the amount of data collected by generating real-like synthetic data. Then, they integrated the mutual information method into the GAN structure to estimate the importance of features, helping the detection model to handle ransomware behavioral drift.</p>
<p>Similarly, Cen et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] proposed analyzing ransomware API calls before the encryption attack begins. They used NLP techniques to represent the features extracted from the API sequences and then employed a Recurrent Neural Network (RNN) classifier to identify the ransomware.</p>
<p>Regarding the use of LLM models in the field of cybersecurity, several studies have appeared in recent years proposing LLM-based approaches for malware analysis and detection. For example, Hossain et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] employed the Mixtral LLM model to learn the patterns and characteristics that facilitate the identification of malicious code in Java programs. In a recent article, Zhou et al. [<xref ref-type="bibr" rid="ref-25">25</xref>] introduced a framework for ransomware detection and classification based on semantic analysis utilizing LLM-assisted pre-training. Their idea involves enriching the dynamic traces of the program by rewriting them in a more understandable linguistic form, then applying LLM-assisted pre-training based on the GPT-2 model, which is subsequently fine-tuned for the tasks of detecting and classifying ransomware into families. In another article [<xref ref-type="bibr" rid="ref-26">26</xref>], Feng et al. proposed a method for detecting Android malware based on LLMs. In their approach, they extracted APK features (permissions, API calls, strings) and combined them into a textual representation, then adapted the LLM through structured prompt engineering and dedicated fine-tuning to identify malicious behavior and detect malware.</p>
<p><xref ref-type="table" rid="table-1">Table 1</xref> provides a comparative overview of the various research discussed in this section. Previous works on ransomware detection have explored different approaches, ranging from static analysis of opcodes and PE files [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-22">22</xref>], to methods based on visual representations or similarity measures [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-23">23</xref>], as well as dynamic approaches analyzing the behavior of malware during its execution [<xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>]. Although these methods have shown good performance, they are still limited in their ability to capture deeper semantic intent or to exploit the rich contextual information embedded in execution traces. In this article, we look to propose a more robust strategy that employs multiple LLM-based embedding models within an ensemble architecture. By utilizing diverse and semantically rich representations generated by state-of-the-art LLMs, our approach can effectively capture behavioral variations and high-level contextual cues that traditional feature engineering methods or single-model techniques may overlook. This enhances our ability to detect the diversity and complexity of malicious behaviors more effectively than conventional techniques.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>A comparative summary of the discussed related works.</title>
</caption>
<table>
<colgroup>
<col align="center" width="20mm"/>
<col align="center" width="20mm"/>
<col align="center" width="20mm"/>
<col align="center" width="20mm"/>
<col align="center" width="28mm"/>
<col align="center" width="30mm"/> </colgroup>
<thead>
<tr>
<th>Paper Reference</th>
<th>Year</th>
<th>Approach Objective</th>
<th>Features Type</th>
<th>Used Methods</th>
<th>Dataset Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>Zhang et al. [<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td>2020</td>
<td>Detect and classify ransomware</td>
<td>Opcode</td>
<td>N-gram, TF-IDF, self-Attention, CNN</td>
<td>Ransomware samples obtained from VirusTotal</td>
</tr>
<tr>
<td>Khammas [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>2020</td>
<td>Detect and classify ransomware</td>
<td>Byte-level</td>
<td>Frequent pattern mining, gain ratio, random forest</td>
<td>Ransomware samples obtained from VirusTotal</td>
</tr>
<tr>
<td>Moreira et al. [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>2023</td>
<td>Detect and classify ransomware</td>
<td>PE header</td>
<td>Xception Convolutional Neural Network</td>
<td>Collected ransomware samples</td>
</tr>
<tr>
<td>Yamany et al. [<xref ref-type="bibr" rid="ref-23">23</xref>]</td>
<td>2022</td>
<td>Ransomware Classification and Clustering</td>
<td>Import Address Table</td>
<td>Jaccard Index, k-means</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Al-rimy et al. [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>2019</td>
<td>Ransomware detection</td>
<td>API calls</td>
<td>Incremental bagging, semi-random subspace selection, ensemble learning</td>
<td>Ransomware samples obtained from virusshare.com</td>
</tr>
<tr>
<td>Berrueta et al. [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>2022</td>
<td>Ransomware detection</td>
<td>network traffic</td>
<td>Decision trees, tree ensembles, and neural networks</td>
<td>Traffic traces from more than 70 ransomwares [<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
</tr>
<tr>
<td>Urooj et al. [<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>2023</td>
<td>Detection of ransomware behavioral drift</td>
<td>API calls</td>
<td>TF-IDF, Generative Adversarial Networks, mutual information</td>
<td>Ransomware dynamic pre-encryption dataset [<xref ref-type="bibr" rid="ref-28">28</xref>]</td>
</tr>
<tr>
<td>Cen et al. [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>2025</td>
<td>Early ransomware detection before encryption</td>
<td>API calls</td>
<td>Occurrence of Words (OoW), Bag of Words (BoW), Sequence of Words (SoW), Recurrent Neural Network (RNN)</td>
<td>Automated dynamic analysis of ransomware [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
</tr>
<tr>
<td>Zhou et al. [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>2025</td>
<td>Ransomware detection using LLM-assisted pre-training</td>
<td>API calls, Windows registry</td>
<td>LLM-assisted task-adaptive pre-training</td>
<td>Automated dynamic analysis of ransomware [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Model</title>
<p>In this section, we present the approach we propose in the current article. This methodology is based on the integration of embedding functions from large language models (LLMs) and ensemble learning techniques. The design and operating principle of the proposed model are illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref> and Algorithm 1. First, textual information is extracted from the PE file, such as header information, imported/exported functions, and section information. From this extracted data, three distinct representations are then generated, each produced by a different embedding model. This diversity allows us to capture complementary contextual and structural properties. Next, these vectors are used to train three neural network classifiers independently. The predictions obtained from the three classifiers are then combined through an ensemble voting method to make a final decision on whether the input sample is ransomware. This methodology simultaneously exploits the complementarity between embedding models and the high performance of the ensemble method.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Architecture of the Ensemble detection model.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74505-fig-1.tif"/>
</fig>
<fig id="fig-4">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74505-fig-4.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>Text Feature Extraction</title>
<p>Textual data extraction from PE files is a method that aims to identify and retrieve meaningful information that is embedded directly in the file structure, without the need to execute it. This data can provide crucial clues about a program&#x2019;s functionality, dependencies, and potential behavior [<xref ref-type="bibr" rid="ref-30">30</xref>]. Several types of information can be extracted from PE files. In our approach, we focused on the following data: header information, section information, imported functions, and exported functions.
<list list-type="bullet">
<list-item>
<p>Header information: The PE header is the first source of information to analyze. It contains a set of data structures that define the properties of the file. Extracting text from this area can reveal fundamental information, such as the target architecture of the program (e.g., x86 or x64), the list of image characteristics, DLL characteristics, major and minor image versions, linker versions, and system versions. In <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, we show an example of the textual information we extracted from the header section of a PE file.</p>
</list-item>
<list-item>
<p>Section information: More detailed information can be found in the section headers [<xref ref-type="bibr" rid="ref-31">31</xref>]. Each section of the PE file, such as .text (executable code), .data (initialized data), or .rdata (read-only data), is described by a header that contains its name, size, and location.</p></list-item>
<list-item>
<p>Imported functions: Imported functions are references to functions located in other modules (usually DLLs) that the program needs to run. The import directory (IMAGE_DIRECTORY_ENTRY_IMPORT) of the PE file lists the required DLLs and the names of the functions to be imported. Analyzing these imports provides an accurate picture of a program&#x2019;s potential functionality, which is essential for assessing its malicious behavior [<xref ref-type="bibr" rid="ref-32">32</xref>].</p></list-item>
<list-item>
<p>Exported functions: These functions are made available to other programs by the PE file. The export directory (IMAGE_DIRECTORY_ENTRY_EXPORT) contains the names and addresses of these functions. Extracting these names is crucial to understanding the role of a DLL and how other executables can interact with it.</p></list-item>
</list></p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>An example of header information extracted from a PE file.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74505-fig-2.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Text Embedding</title>
<p>Text embedding is a process of converting text into digital representations helpful in building machine learning models. Large language models (LLMs) have recently revolutionized this field by producing high-quality embeddings. The embedding operation performed by LLMs involves projecting a sequence of text <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>x</mml:mi></mml:math></inline-formula> into a high-dimensional space <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msup></mml:math></inline-formula>, while capturing semantic and context properties [<xref ref-type="bibr" rid="ref-33">33</xref>]. These representations enable the measurement of similarity between texts, facilitate knowledge transfer, and serve as input for supervised or unsupervised learning models.</p>
<p>In our approach, we aimed to take advantage of these revolutionary models to represent the textual content of PE files. Our idea is to generate three vector representations from three different text embedding models (OpenAI, VoyageAI, and Sentence Transformers). Using multiple text embedding models simultaneously allows us to benefit from the complementarity of their textual representations and leads to richer and more robust representations. This idea improves the performance of classification tasks and promotes better generalization, particularly in complex contexts such as malware detection.</p>
<p>The embedding operation in our approach can be represented by the following formula:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></disp-formula>where <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>x</mml:mi></mml:math></inline-formula> is the input text, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>E</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is the <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>i</mml:mi></mml:math></inline-formula>th embedding model used to convert the text, <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is the resulting numerical representation, and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>d</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is the dimension of the corresponding vector.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>OpenAI Text Embedding</title>
<p>After the success of its famous GPT model, OpenAI developed other models dedicated to text embedding. These models use large-scale contrastive pre-training to build vector representations that can show the semantic and contextual links between texts [<xref ref-type="bibr" rid="ref-34">34</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>]. The architecture of the embedding model is based on Transform Encoder <italic>E</italic>, which projects input text into a dense representation <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>v</mml:mi><mml:mo>=</mml:mo><mml:mi>E</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msup></mml:math></inline-formula> of fixed size that can be directly used for semantic search, classification, or clustering tasks. In our case, we used the &#x201C;text-embedding-3-small&#x201D; model [<xref ref-type="bibr" rid="ref-35">35</xref>], which provides a representation vector of size 1536.</p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>VoyageAI Text Embedding</title>
<p>VoyageAI recently introduced a family of text embedding models [<xref ref-type="bibr" rid="ref-36">36</xref>] optimized for semantic search, retrieval-augmented generation (RAG), and specialized domains (law, finance, code, etc.). VoyageAI models are renowned for their high performance on benchmark tests such as MTEB (Massive Text Embedding Benchmark). They are based on Transformer-type architectures capable of generating dense vectors of configurable dimensions (256 to 2048) and processing extended contexts of up to 32,000 tokens, which is an advantage for processing long documents. The model we use in our approach is called &#x201C;voyage-3.5&#x201D; [<xref ref-type="bibr" rid="ref-36">36</xref>] and generates representation vectors of size 1024.</p>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>Sentence-Transformers Embedding</title>
<p>The Sentence-Transformers model, also known as Sentence-BERT (SBERT) [<xref ref-type="bibr" rid="ref-37">37</xref>], is a modified version of BERT (Bidirectional Encoder Representations from Transformers), designed to make good sentence embeddings for tasks like semantic similarity, information retrieval, and clustering. SBERT uses siamese and triplet network architecture to generate semantically meaningful sentence embeddings to facilitate comparisons via measures such as cosine similarity. Unlike standard BERT, SBERT utilizes a pooling strategy (like mean pooling) on the output embeddings to produce a single vector representation for each sentence. 1024 is the size of the output vector we obtain from this embedding model.</p>
</sec>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>MLP Classifier</title>
<p>Multilayer Perceptrons (MLP) are artificial neural networks composed of several layers of neurons: an input layer, one or more hidden layers, and an output layer [<xref ref-type="bibr" rid="ref-38">38</xref>]. Multilayer perceptrons are constructed to be fully interconnected, with each node linked to every node in adjacent layers. The outputs of the first layer are used as inputs for the next layer, and so on until the output layer is reached.</p>
<p>In our case, we use an MLP for binary classification using a single hidden layer. The output of this hidden layer, <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>h</mml:mi></mml:math></inline-formula>, is a vector that can be calculated as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>h</mml:mi><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>x</mml:mi></mml:math></inline-formula> is an input vector, <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> is the weight matrix applied between the input layer and the hidden layer, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>b</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> is the bias of the hidden layer, and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>g</mml:mi></mml:math></inline-formula> is a non-linear activation function (e.g., ReLU).</p>
<p>The output <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>h</mml:mi></mml:math></inline-formula> becomes the input to the output layer, and the final output value will be calculated as follows:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mi>h</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> is the weight matrix applied between the hidden layer and the output layer, <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>b</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> is the bias of the output layer, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is the Sigmoid activation function, defined as:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>This MLP classifier is applied to each vector representation issued from the three embedding models described above. The objective is to calculate three separate predictions determining the probability that the input sample is considered ransomware. This diversity helps reduce reliance on a single vector space and limits biases associated with a particular model.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Weighted Voting Ensemble</title>
<p>The previous step provided us with three different classification models, each generating predictions that may be similar or different from the other models. To make a final decision about the prediction of the input sample, we took advantage of ensemble learning techniques in our approach. Ensemble learning is a method in machine learning based on training multiple models to predict a solution for a given problem [<xref ref-type="bibr" rid="ref-39">39</xref>]. An ensemble model consists of several learners, commonly known as base learners. This approach is notable for its ability to combine multiple weak learners, resulting in a strong learner that demonstrates greater accuracy than any of the individual base learners.</p>
<p>To combine the results of the base models in an ensemble architecture and provide a single output prediction, several methods have been proposed in the literature, including stacking, bagging, boosting, and voting. In this approach, we chose the weighted voting method to aggregate the predictions of the three models. The idea behind this method considers that base learners do not perform equally well, so it is more appropriate to focus on the best-performing models. Weight-based voting consists of assigning different weights to the base models according to specific criteria and voting on the results, taking their weights into account. Generally, the best-performing model receives the highest weight [<xref ref-type="bibr" rid="ref-21">21</xref>,<xref ref-type="bibr" rid="ref-40">40</xref>].</p>
<p>To explain the weighted voting process, let us consider <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>p</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> as the predictions obtained from the three MLP classifiers <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>h</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>h</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>h</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula>, respectively, and <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>W</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> as the weights assigned to the three MLP models. We begin by calculating the aggregate value of the three predictions, taking into account the associated weights, using the following formula:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Then, we determine the output value <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> by voting on the final decision of the instance, which can be 0 or 1, using the following formula:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2265;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> is a decision threshold generally equal to 0.5.</p>
<p>The challenge we face with this voting method is how to find the optimal weight combination. This choice is crucial to ensuring the best performance of the ensemble model. In our approach, we opted for the Bayesian optimization method to select the right weights. Bayesian optimization is an effective method for enhancing objective functions that are difficult to evaluate or require extensive analysis time [<xref ref-type="bibr" rid="ref-41">41</xref>]. The fundamental goal of Bayesian optimization is to identify the global minimum (or maximum) of a function <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> by developing a probabilistic model, known as a surrogate model, that approximates <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and is refined after each evaluation of <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-42">42</xref>]. Adopting this method in our approach leads us to solve the following optimization problem:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>&#x03C7;</mml:mi></mml:mrow></mml:munder><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:math></inline-formula> represents the weights of the ensemble model that we aim to optimize. <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>&#x03C7;</mml:mi></mml:math></inline-formula> defines the search space for the weights. The objective function <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> refers to the performance of the ensemble model according to the chosen weights. This performance is evaluated by calculating the accuracy measure. The solution to this optimization problem consists of finding the optimal combination of weights <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:math></inline-formula> that helps the function <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> achieve the best performance.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Evaluation</title>
<p>Experimental benchmarking is a crucial step in evaluating an approach. To this end, we have prepared several experiments based on the model proposed in this article. For the comparative tests, we also tested other baseline algorithms using vectorized features instead of text embedding. In this section, we present the results of these experiments and analyze our findings.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset Description</title>
<p>To train the proposed model, we used the Ember dataset to gather a collection of ransomware samples. EMBER (Endgame Malware Benchmark for Research) [<xref ref-type="bibr" rid="ref-43">43</xref>] is a labeled dataset specifically designed to benchmark machine learning models dedicated to the detection of malicious Windows portable executable files. The dataset comprises information obtained from 1.1 million binary files. It contains several raw characteristics extracted from PE files, including general file information, header information, imported functions, exported functions, section information, and format-agnostic information. This dataset, which originally contains several types of malwares, was useful for us to generate a more targeted dataset containing only ransomware. The final dataset used in our experimental tests comprises 13,623 ransomware samples and 50,000 benign samples.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Results</title>
<sec id="s4_2_1">
<label>4.2.1</label>
<title>Comparison with Base Embedding Models</title>
<p>To evaluate the performance of the proposed approach, we first assessed the three basic models individually: (i) an OpenAI embedding combined with an MLP classifier, (ii) a VoyageAI embedding with an MLP classifier, and (iii) an embedding with Sentence-Transformers followed by an MLP classifier. Subsequently, we tested the Ensemble model, which applies weighted voting to the outputs of the three individual models. The objective of these experiments is to determine whether the weighted voting method can outperform the basic models.</p>
<p>To achieve a more robust comparative evaluation, we used 5-fold cross-validation to split the dataset. We then calculated four evaluation metrics for each model: accuracy, precision, recall, and F1-score, and reported the average value of the 5-fold tests for each metric. Additionally, to provide insight into the computation duration, we included the training time for each model. Regarding the parameters of the MLP classifier, we used a neural network with a single hidden layer and an input layer whose size varied depending on the type of embedding: 1536 in the case of OpenAI and 1024 for both VoyageAI and Sentence-Transformers. The results of these experiments are presented in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison of results between ensemble method and base embedding models.</title>
</caption>
<table>
<colgroup>
<col align="center" width="41mm"/>
<col align="center" width="17mm"/>
<col align="center" width="17mm"/>
<col align="center" width="17mm"/>
<col align="center" width="17mm"/>
<col align="center" width="28mm"/> </colgroup>
<thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1-score</th>
<th>Training Time (second)</th>
</tr>
</thead>
<tbody>
<tr>
<td>OpenAI embedding &#x002B; MLP</td>
<td>0.9788756</td>
<td>0.95328</td>
<td>0.9479512</td>
<td>0.9505264</td>
<td>209.04</td>
</tr>
<tr>
<td>VoyageAI embedding &#x002B; MLP</td>
<td>0.9807934</td>
<td>0.957177</td>
<td>0.9529868</td>
<td>0.9550372</td>
<td>117.86</td>
</tr>
<tr>
<td>Sentence-Transformers embedding &#x002B; MLP</td>
<td>0.9781524</td>
<td>0.952077</td>
<td>0.9456182</td>
<td>0.948786</td>
<td>198.78</td>
</tr>
<tr>
<td>Weighted Voting Ensemble</td>
<td><bold>0.98675</bold></td>
<td><bold>0.977733</bold></td>
<td><bold>0.9599914</bold></td>
<td><bold>0.968767</bold></td>
<td>533.89</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>By analyzing the results of the three base models, we conclude that the VoyageAI embedding demonstrates superior performance in ransomware detection, achieving an average accuracy of 0.9808, a precision of 0.9572, a recall of 0.953, and an F1 score of 0.955, while also requiring a faster training time. The other embedding models also performed well, with OpenAI reaching an accuracy of 0.9789 and Sentence-Transformers achieving an accuracy of 0.9782.</p>
<p>When examining the results of the Weighted Voting Ensemble model, we note a fairly significant improvement in all evaluation metrics compared to the base models. The ensemble model achieved an accuracy of 0.9868, a precision of 0.9777, a recall of 0.9600, and an F1 score of 0.9688. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> illustrates the percentage improvements of our ensemble model compared to the base models, indicating a positive enhancement across the board, with improvements ranging from 0.6% to 2.6%. For instance, the ensemble method showed an improvement over the OpenAI model of 0.8% in accuracy, 2.4% in precision, 1.2% in recall, and 1.8% in F1 score.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Improvement of the weighted ensemble method.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74505-fig-3.tif"/>
</fig>
</sec>
<sec id="s4_2_2">
<label>4.2.2</label>
<title>Comparison with Algorithms Using Vectorized Features</title>
<p>Vectorized features are a numerical representation of the raw features of the PE file based on feature hashing methods. This type of feature was designed by the creators of the Ember dataset to assist researchers in training their models. To compare this technique with the approach we propose, we tested it in combination with baseline classification algorithms, including KNN, Decision Tree, Logistic Regression, Random Forest, XGBoost, AdaBoost, and MLP. The results of these experiments are presented in <xref ref-type="table" rid="table-3">Table 3</xref>.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Comparison of results between our ensemble method and baseline classification algorithms.</title>
</caption>
<table>
<colgroup>
<col align="center" width="30mm"/>
<col align="center" width="20mm"/>
<col align="center" width="16mm"/>
<col align="center" width="16mm"/>
<col align="center" width="16mm"/>
<col align="center" width="16mm"/>
<col align="center" width="25mm"/> </colgroup>
<thead>
<tr>
<th>Features Type</th>
<th>Model</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1-score</th>
<th>Training Time (second)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="7">Vectorized features</td>
<td>KNN</td>
<td>0.954592</td>
<td>0.917379</td>
<td>0.89095</td>
<td>0.866072</td>
<td>0.21</td>
</tr>
<tr>
<td>Decision tree</td>
<td>0.967811</td>
<td>0.92084</td>
<td>0.925206</td>
<td>0.929641</td>
<td>230.79</td>
</tr>
<tr>
<td>Logistic regression</td>
<td>0.966977</td>
<td>0.929855</td>
<td>0.922205</td>
<td>0.914719</td>
<td>72.83</td>
</tr>
<tr>
<td>Random Forest</td>
<td>0.979379</td>
<td>0.993908</td>
<td>0.949692</td>
<td>0.909313</td>
<td>106.19</td>
</tr>
<tr>
<td>XGBoost</td>
<td>0.983015</td>
<td>0.977506</td>
<td>0.959365</td>
<td>0.941908</td>
<td>15.93</td>
</tr>
<tr>
<td>AdaBoost</td>
<td>0.940776</td>
<td>0.89487</td>
<td>0.855694</td>
<td>0.820261</td>
<td>257.33</td>
</tr>
<tr>
<td>MLP</td>
<td>0.980636</td>
<td>0.954521</td>
<td>0.954755</td>
<td>0.95502</td>
<td>225.69</td>
</tr>
<tr>
<td>Combined Embeddings (OpenAI &#x002B; VoyageAI &#x002B; Sentence Transformers)</td>
<td>KNN</td>
<td>0.974915</td>
<td>0.943942</td>
<td>0.941246</td>
<td>0.938582</td>
<td>0.41</td>
</tr>
<tr>
<td/>
<td>Decision tree</td>
<td>0.956022</td>
<td>0.892163</td>
<td>0.897982</td>
<td>0.903908</td>
<td>1522.82</td>
</tr>
<tr>
<td/>
<td>Logistic Regression</td>
<td>0.962655</td>
<td>0.942265</td>
<td>0.909773</td>
<td>0.879531</td>
<td>22.05</td>
</tr>
<tr>
<td/>
<td>Random Forest</td>
<td>0.977492</td>
<td>0.985496</td>
<td>0.945302</td>
<td>0.908312</td>
<td>578.74</td>
</tr>
<tr>
<td/>
<td>XGBoost</td>
<td>0.982176</td>
<td>0.976628</td>
<td>0.957557</td>
<td>0.939276</td>
<td>133.77</td>
</tr>
<tr>
<td/>
<td>AdaBoost</td>
<td>0.936438</td>
<td>0.868788</td>
<td>0.848019</td>
<td>0.828367</td>
<td>1532.22</td>
</tr>
<tr>
<td/>
<td>MLP</td>
<td>0.977414</td>
<td>0.957133</td>
<td>0.946776</td>
<td>0.936887</td>
<td>425.64</td>
</tr>
<tr>
<td>Ensemble Embedding (Our approach)</td>
<td>Weighted voting ensemble (Our approach)</td>
<td><bold>0.98675</bold></td>
<td>0.977733</td>
<td><bold>0.9599914</bold></td>
<td><bold>0.968767</bold></td>
<td>533.89</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Analysis of the results indicates that classification accuracy varies depending on the algorithm used. Generally, algorithms based on decision trees such as XGBoost and Random Forest performed better than the others. XGBoost was the best with an accuracy of 0.9830, a precision of 0.9775, a recall of 0.9594, and an F1 score of 0.9419. The MLP classifier also showed comparable results, achieving an accuracy of 0.9806. However, all these results were still inferior to those obtained by our Weighted Voting Ensemble model.</p>
</sec>
<sec id="s4_2_3">
<label>4.2.3</label>
<title>Comparison with Algorithms Using Combined Embeddings</title>
<p>The combination of the three generated embeddings provides a simple and important baseline by merging all feature vectors into a single unified and rich representation, allowing a classifier to learn from the full information space without relying on voting or fusion strategies. To assess the benefits of our weighted voting ensemble method compared to this baseline strategy, we trained seven algorithms (KNN, Decision Tree, Logistic Regression, Random Forest, XGBoost, AdaBoost, and MLP) on this concatenated feature vector and reported their classification results in <xref ref-type="table" rid="table-3">Table 3</xref>.</p>

<p>As in the previous experiment, the XGBoost, Random Forest, and MLP algorithms showed the best results. However, none of these algorithms surpassed the effectiveness of our approach, which outperformed all the methods and algorithms we tested during these experiments.</p>
</sec>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Discussion</title>
<p>The experimental results demonstrated that integrating OpenAI, VoyageAI, and Sentence-Transformers embeddings with MLP classifiers in an ensemble method based on weighted voting consistently improves classification performance. In <xref ref-type="table" rid="table-4">Table 4</xref>, we present a performance comparison between the model we propose in the current paper and the results reported in previous studies. Despite the absence of a unified benchmarking dataset for ransomware detection, this comparison remains highly informative. It allows us to position our method relative to existing research and to assess whether it can effectively compete with or surpass state-of-the-art techniques. The results clearly show that our approach demonstrates superior performance, proving the strong robustness of our LLM-based ensemble strategy over traditional models, even when contrasted with diverse methodologies evaluated on different datasets.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Performance comparison with related works.</title>
</caption>
<table>
<colgroup>
<col align="center" width="35mm"/>
<col align="center" width="20mm"/>
<col align="center" width="50mm"/>
<col align="center" width="25mm"/> </colgroup>
<thead>
<tr>
<th>Paper Reference</th>
<th>Year</th>
<th>Used Methods</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>Khammas [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>2020</td>
<td>Frequent pattern mining, gain ratio, random forest</td>
<td>97.74%</td>
</tr>
<tr>
<td>Moreira et al. [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>2023</td>
<td>Xception Convolutional Neural Network</td>
<td>93.73% and 98.20%</td>
</tr>
<tr>
<td>Al-rimy et al. [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>2019</td>
<td>Incremental bagging, semi-random subspace selection, ensemble learning</td>
<td>97.89%</td>
</tr>
<tr>
<td>Urooj et al. [<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>2023</td>
<td>TF-IDF, Generative Adversarial Networks, mutual information</td>
<td>97%</td>
</tr>
<tr>
<td>Cen et al. [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>2025</td>
<td>Occurrence of Words (OoW), Bag of Words (BoW), Sequence of Words (SoW), Recurrent Neural Network (RNN).</td>
<td>94.26%</td>
</tr>
<tr>
<td>Zhou et al. [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>2025</td>
<td>LLM-assisted task-adaptive pre-training</td>
<td>95.5%</td>
</tr>
<tr>
<td>Our approach</td>
<td>2025</td>
<td>LLM embedding, MLP, Ensemble voting</td>
<td><bold>98.67%</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The analysis of <xref ref-type="table" rid="table-3">Table 3</xref> allowed us to conclude that VoyageAI performs better in terms of accuracy, recall, and F1-score while OpenAI offers better precision. This explains why OpenAI produces fewer false positives while VoyageAI produces fewer false negatives. These findings highlight the advantage of leveraging multiple embedding models, as each representation encodes distinct semantic and contextual information about the input data. By combining these complementary representations, the ensemble method will have the ability to reduce dependence on a single embedding space and, as a result, mitigate the risk of systematic errors inherent in a model. The weighted voting technique also contributed to improving the overall performance of the approach by assigning greater influence to base models with higher predictive reliability.</p>

<p>Although this performance improvement may seem slight when compared to the base VoyageAI model or the XGBoost model using vectorized features, it is still significant. This is very important in high-stakes domains such as ransomware detection, where even slight gains in recall and accuracy can have a significant impact by reducing the risk of undetected threats.</p>
<p>Another characteristic that we identified during experimental testing of the proposed approach is the high computational cost caused by the ensemble architecture. Deploying three learning models together requires more hardware resources and can take more time to generate results. This aspect must be considered in applications that demand real-time responses or when deploying on devices with limited resources. Future work should investigate more efficient ensembling or embedding techniques to balance performance with hardware utilization.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this article, we proposed a new approach for ransomware detection based on ensemble learning and cutting-edge language models (LLMs). Our first goal is to leverage the embedding capabilities of the latest LLMs to generate rich and meaningful representations of executable files. The second goal is to create a robust and accurate detection model based on ensemble learning techniques. The proposed approach incorporates three base MLP classifiers associated with three different embedding models (OpenAI, VoyageAI, and Sentence-Transformers) and an optimized weighted voting technique to make the right decision regarding the input sample. Experimental results showed the effectiveness of our model, which exceeds the performance of the individual base models, achieving an accuracy of 98.67%, a precision of 97.77%, a recall of 96%, and an F1-score of 96.88%.</p>
<p>In the future, we plan to enrich the learning data by integrating more dynamic features that accurately reflect the program&#x2019;s behavior during execution. We also envisage streamlining the proposed system by designing new embedding techniques that are both effective and less resource-intensive.</p>
</sec>
</body>
<back>
<ack>
<p>The authors extend their appreciation to the Deanship of Graduate Studies and Scientific Research at Jouf University for funding this work.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was funded by the Deanship of Graduate Studies and Scientific Research at Jouf University under grant No. (DGSSR-2024-02-01176).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Study conception and design: Abdallah Ghourabi and Hassen Chouaib; data collection: Abdallah Ghourabi; analysis and interpretation of results: Abdallah Ghourabi; draft manuscript preparation: Abdallah Ghourabi and Hassen Chouaib. All authors reviewed and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are available from the Corresponding Author, Abdallah Ghourabi, upon reasonable request.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>P</given-names></string-name>, <string-name><surname>Talaei Khoei</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Securing the internet of things: a comprehensive review of ransomware attacks, detection, countermeasures, and future prospects</article-title>. <source>Franklin Open</source>. <year>2025</year>;<volume>11</volume>:<fpage>100256</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.fraope.2025.100256</pub-id>.</mixed-citation></ref> 
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Beaman</surname> <given-names>C</given-names></string-name>, <string-name><surname>Barkworth</surname> <given-names>A</given-names></string-name>, <string-name><surname>Akande</surname> <given-names>TD</given-names></string-name>, <string-name><surname>Hakak</surname> <given-names>S</given-names></string-name>, <string-name><surname>Khan</surname> <given-names>MK</given-names></string-name></person-group>. <article-title>Ransomware: recent advances, analysis, challenges and future research directions</article-title>. <source>Comput Secur</source>. <year>2021</year>;<volume>111</volume>:<fpage>102490</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cose.2021.102490</pub-id>; <pub-id pub-id-type="pmid">34602684</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Statista</collab></person-group>. <article-title>Total annual amount of money received by ransomware actors worldwide from 2017 to 2024</article-title>. <year>2025 [Internet]</year>. <comment>[cited 2025 Oct 1]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://www.statista.com/statistics/1410498/ransomware-revenue-annual/">https://www.statista.com/statistics/1410498/ransomware-revenue-annual/</ext-link>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Adams</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Learning the lessons of WannaCry</article-title>. <source>Comput Fraud Secur</source>. <year>2018</year>;<volume>2018</volume>(<issue>9</issue>):<fpage>6</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1016/S1361-3723(18)30084-8</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Sangaiah</surname> <given-names>AK</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Ransomware classification using patch-based CNN and self-attention network on embedded N-grams of opcodes</article-title>. <source>Future Gener Comput Syst</source>. <year>2020</year>;<volume>110</volume>:<fpage>708</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.future.2019.09.025</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Al-rimy</surname> <given-names>BAS</given-names></string-name>, <string-name><surname>Maarof</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Shaid</surname> <given-names>SZM</given-names></string-name></person-group>. <article-title>Crypto-ransomware early detection model using novel incremental bagging with enhanced semi-random subspace selection</article-title>. <source>Future Gener Comput Syst</source>. <year>2019</year>;<volume>101</volume>:<fpage>476</fpage>&#x2013;<lpage>91</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.future.2019.06.005</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Urooj</surname> <given-names>U</given-names></string-name>, <string-name><surname>Al-Rimy</surname> <given-names>BAS</given-names></string-name>, <string-name><surname>Zainal</surname> <given-names>AB</given-names></string-name>, <string-name><surname>Saeed</surname> <given-names>F</given-names></string-name>, <string-name><surname>Abdelmaboud</surname> <given-names>A</given-names></string-name>, <string-name><surname>Nagmeldin</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Addressing behavioral drift in ransomware early detection through weighted generative adversarial networks</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>:<fpage>3910</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2023.3348451</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Doss</surname> <given-names>R</given-names></string-name></person-group>. <article-title>RansoGuard: a RNN-based framework leveraging pre-attack sensitive APIs for early ransomware detection</article-title>. <source>Comput Secur</source>. <year>2025</year>;<volume>150</volume>(<issue>1</issue>):<fpage>104293</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cose.2024.104293</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>S</given-names></string-name>, <string-name><surname>Shu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Nie</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Android malware detection method based on CNN and DNN bybrid mechanism</article-title>. <source>IEEE Trans Ind Inform</source>. <year>2024</year>;<volume>20</volume>(<issue>5</issue>):<fpage>7744</fpage>&#x2013;<lpage>53</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TII.2024.3363016</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Moreira</surname> <given-names>CC</given-names></string-name>, <string-name><surname>Moreira</surname> <given-names>DC</given-names></string-name>, <string-name><surname>Sales</surname> <given-names>CDSD</given-names> <suffix>Jr</suffix></string-name></person-group>. <article-title>Improving ransomware detection based on portable executable header using xception convolutional neural network</article-title>. <source>Comput Secur</source>. <year>2023</year>;<volume>130</volume>(<issue>18</issue>):<fpage>103265</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cose.2023.103265</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Berrueta</surname> <given-names>E</given-names></string-name>, <string-name><surname>Morato</surname> <given-names>D</given-names></string-name>, <string-name><surname>Maga&#x00F1;a</surname> <given-names>E</given-names></string-name>, <string-name><surname>Izal</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Crypto-ransomware detection using machine learning models in file-sharing network scenarios with encrypted traffic</article-title>. <source>Expert Syst Appl</source>. <year>2022</year>;<volume>209</volume>:<fpage>118299</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2022.118299</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ferrag</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Alwahedi</surname> <given-names>F</given-names></string-name>, <string-name><surname>Battah</surname> <given-names>A</given-names></string-name>, <string-name><surname>Cherif</surname> <given-names>B</given-names></string-name>, <string-name><surname>Mechri</surname> <given-names>A</given-names></string-name>, <string-name><surname>Tihanyi</surname> <given-names>N</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Generative AI in cybersecurity: a comprehensive review of LLM applications and vulnerabilities</article-title>. <source>Inter Things Cyber-Phys Syst</source>. <year>2025</year>;<volume>5</volume>:<fpage>1</fpage>&#x2013;<lpage>46</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.iotcps.2025.01.001</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>RTIDS: a robust transformer-based approach for intrusion detection system</article-title>. <source>IEEE Access</source>. <year>2022</year>;<volume>10</volume>:<fpage>64375</fpage>&#x2013;<lpage>87</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2022.3182333</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kheddar</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Transformers and large language models for efficient intrusion detection systems: a comprehensive survey</article-title>. <source>Inf Fusion</source>. <year>2025</year>;<volume>124</volume>(<issue>1</issue>):<fpage>103347</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.inffus.2025.103347</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>De La Torre Parra</surname> <given-names>G</given-names></string-name>, <string-name><surname>Selvera</surname> <given-names>L</given-names></string-name>, <string-name><surname>Khoury</surname> <given-names>J</given-names></string-name>, <string-name><surname>Irizarry</surname> <given-names>H</given-names></string-name>, <string-name><surname>Bou-Harb</surname> <given-names>E</given-names></string-name>, <string-name><surname>Rad</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Interpretable federated transformer log learning for cloud threat forensics</article-title>. In: <conf-name>Proceedings 2022 Network and Distributed System Security Symposium, NDSS 2022</conf-name>. <publisher-loc>Reston, VA, USA</publisher-loc>: <publisher-name>Internet Society</publisher-name>; <year>2022</year>. doi:<pub-id pub-id-type="doi">10.14722/ndss.2022.23102</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ghourabi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>A security model based on LightGBM and transformer to protect healthcare systems from cyberattacks</article-title>. <source>IEEE Access</source>. <year>2022</year>;<volume>10</volume>:<fpage>48890</fpage>&#x2013;<lpage>903</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2022.3172432</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Evangelatos</surname> <given-names>P</given-names></string-name>, <string-name><surname>Iliou</surname> <given-names>C</given-names></string-name>, <string-name><surname>Mavropoulos</surname> <given-names>T</given-names></string-name>, <string-name><surname>Apostolou</surname> <given-names>K</given-names></string-name>, <string-name><surname>Tsikrika</surname> <given-names>T</given-names></string-name>, <string-name><surname>Vrochidis</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Named entity recognition in cyber threat intelligence using transformer-based models</article-title>. In: <conf-name>2021 IEEE International Conference on Cyber Security and Resilience (CSR)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>348</fpage>&#x2013;<lpage>53</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CSR51186.2021.9527981</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Demirkiran</surname> <given-names>F</given-names></string-name>, <string-name><surname>&#x000C7;ayır</surname> <given-names>A</given-names></string-name>, <string-name><surname>&#x00DC;nal</surname> <given-names>U</given-names></string-name>, <string-name><surname>Da&#x0011F;</surname> <given-names>H</given-names></string-name></person-group>. <article-title>An ensemble of pre-trained transformer models for imbalanced multiclass malware classification</article-title>. <source>Comput Secur</source>. <year>2022</year>;<volume>121</volume>:<fpage>102846</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cose.2022.102846</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Patsakis</surname> <given-names>C</given-names></string-name>, <string-name><surname>Casino</surname> <given-names>F</given-names></string-name>, <string-name><surname>Lykousas</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Assessing LLMs in malicious code deobfuscation of real-world malware campaigns</article-title>. <source>Expert Syst Appl</source>. <year>2024</year>;<volume>256</volume>:<fpage>124912</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2024.124912</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jamal</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wimmer</surname> <given-names>H</given-names></string-name>, <string-name><surname>Sarker</surname> <given-names>IH</given-names></string-name></person-group>. <article-title>An improved transformer-based model for detecting phishing, spam and ham emails: a large language model approach</article-title>. <source>Secur Priv</source>. <year>2024</year>;<volume>7</volume>(<issue>5</issue>):<fpage>e402</fpage>. doi:<pub-id pub-id-type="doi">10.1002/spy2.402</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ghourabi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Alohaly</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Enhancing spam message classification and detection using transformer-based embedding and ensemble learning</article-title>. <source>Sensors</source>. <year>2023</year>;<volume>23</volume>(<issue>8</issue>):<fpage>3861</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s23083861</pub-id>; <pub-id pub-id-type="pmid">37112202</pub-id></mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Khammas</surname> <given-names>BM</given-names></string-name></person-group>. <article-title>Ransomware detection using random forest technique</article-title>. <source>ICT Express</source>. <year>2020</year>;<volume>6</volume>(<issue>4</issue>):<fpage>325</fpage>&#x2013;<lpage>31</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.icte.2020.11.001</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yamany</surname> <given-names>B</given-names></string-name>, <string-name><surname>Elsayed</surname> <given-names>MS</given-names></string-name>, <string-name><surname>Jurcut</surname> <given-names>AD</given-names></string-name>, <string-name><surname>Abdelbaki</surname> <given-names>N</given-names></string-name>, <string-name><surname>Azer</surname> <given-names>MA</given-names></string-name></person-group>. <article-title>A new scheme for ransomware classification and clustering using static features</article-title>. <source>Electronics</source>. <year>2022</year>;<volume>11</volume>(<issue>20</issue>):<fpage>3307</fpage>. doi:<pub-id pub-id-type="doi">10.3390/electronics11203307</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hossain</surname> <given-names>AA</given-names></string-name>, <string-name><surname>PK</surname> <given-names>MK</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Amsaad</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Malicious code detection using LLM</article-title>. In: <conf-name>NAECON 2024&#x2014;IEEE National Aerospace and Electronics Conference</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2024</year>. p. <fpage>414</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/NAECON61878.2024.10670668</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>W</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>F</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>SRDC: semantics-based ransomware detection and classification with LLM-assisted pre-training</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2025</year>;<volume>39</volume>(<issue>27</issue>):<fpage>28566</fpage>&#x2013;<lpage>74</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v39i27.35080</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname> <given-names>R</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Monjurul Karim</surname> <given-names>M</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>LLM-MalDetect: a large language model-based method for android malware detection</article-title>. <source>IEEE Access</source>. <year>2025</year>;<volume>13</volume>:<fpage>81347</fpage>&#x2013;<lpage>64</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2025.3565526</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Berrueta</surname> <given-names>E</given-names></string-name></person-group>. <article-title>Ransomware and user samples for training and validating ML models. Mendeley</article-title>. <year>2020 [Internet]</year>. <comment>[cited 2025 Oct 1]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://data.mendeley.com/datasets/yhg5wk39kf/1">https://data.mendeley.com/datasets/yhg5wk39kf/1</ext-link>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Al-Rimy</surname> <given-names>BAS</given-names></string-name>, <string-name><surname>Maarof</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Alazab</surname> <given-names>M</given-names></string-name>, <string-name><surname>Alsolami</surname> <given-names>F</given-names></string-name>, <string-name><surname>Shaid</surname> <given-names>SZM</given-names></string-name>, <string-name><surname>Ghaleb</surname> <given-names>FA</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A pseudo feedback-based annotated TF-IDF technique for dynamic crypto-ransomware pre-encryption boundary delineation and features extraction</article-title>. <source>IEEE Access</source>. <year>2020</year>;<volume>8</volume>:<fpage>140586</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2020.3012674</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Sgandurra</surname> <given-names>D</given-names></string-name>, <string-name><surname>Mu&#x00F1;oz-Gonz&#x00E1;lez</surname> <given-names>L</given-names></string-name>, <string-name><surname>Mohsen</surname> <given-names>R</given-names></string-name>, <string-name><surname>Lupu</surname> <given-names>EC</given-names></string-name></person-group>. <article-title>Automated dynamic analysis of ransomware: benefits, limitations and use for detection</article-title>. <comment>arXiv:1609.03020</comment>. <year>2016</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Shafiq</surname> <given-names>MZ</given-names></string-name>, <string-name><surname>Tabish</surname> <given-names>SM</given-names></string-name>, <string-name><surname>Mirza</surname> <given-names>F</given-names></string-name>, <string-name><surname>Farooq</surname> <given-names>M</given-names></string-name></person-group>. <source>IPE-Miner: mining structural information to detect malicious executables in realtime</source>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2009</year>. p. <fpage>121</fpage>&#x2013;<lpage>41</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-642-04342-0_7</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Martins</surname> <given-names>E</given-names></string-name>, <string-name><surname>Sant&#x2019;Ana</surname> <given-names>R</given-names></string-name>, <string-name><surname>Higuera</surname> <given-names>JRB</given-names></string-name>, <string-name><surname>Montalvo</surname> <given-names>JAS</given-names></string-name>, <string-name><surname>Higuera</surname> <given-names>JB</given-names></string-name>, <string-name><surname>Castillo</surname> <given-names>DP</given-names></string-name></person-group>. <article-title>Semantic malware classification using artificial intelligence techniques</article-title>. <source>Comput Model Eng Sci</source>. <year>2025</year>;<volume>142</volume>(<issue>3</issue>):<fpage>3031</fpage>&#x2013;<lpage>67</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmes.2025.061080</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Sikorski</surname> <given-names>M</given-names></string-name>, <string-name><surname>Honig</surname> <given-names>A</given-names></string-name></person-group>. <source>Practical malware analysis: the hands-on guide to dissecting malicious software</source>. <publisher-loc>San Francisco, CA, USA</publisher-loc>: <publisher-name>No Starch Press</publisher-name>; <year>2012</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Nie</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Long</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>When text embedding meets large language model: a comprehensive survey</article-title>. <comment>arXiv:2412.09165</comment>. <year>2024</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Neelakantan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Puri</surname> <given-names>R</given-names></string-name>, <string-name><surname>Radford</surname> <given-names>A</given-names></string-name>, <string-name><surname>Han</surname> <given-names>JM</given-names></string-name>, <string-name><surname>Tworek</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Text and code embeddings by contrastive pre-training</article-title>. <comment>arXiv:2201.10005</comment>. <year>2022</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>OpenAI</collab></person-group>. <article-title>New embedding models and API updates</article-title>. <year>2024 [Internet]</year>. <comment>[cited 2025 Oct 1]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://openai.com/index/new-embedding-models-and-api-updates/">https://openai.com/index/new-embedding-models-and-api-updates/</ext-link>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>VoyageAI</collab></person-group>. <article-title>voyage-3.5 and voyage-3.5-lite: improved quality for a new retrieval frontier</article-title>. <year>2025 [Internet]</year>. <comment>[cited 2025 Oct 1]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://blog.voyageai.com/2025/05/20/voyage-3-5/">https://blog.voyageai.com/2025/05/20/voyage-3-5/</ext-link>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Reimers</surname> <given-names>N</given-names></string-name>, <string-name><surname>Gurevych</surname> <given-names>I</given-names></string-name></person-group>. <article-title>Sentence-BERT: sentence embeddings using siamese BERT-networks</article-title>. In: <conf-name>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</conf-name>. <publisher-loc>Wierden, The Netherlands</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>; <year>2019</year>. p. <fpage>3982</fpage>&#x2013;<lpage>92</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/D19-1410</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chan</surname> <given-names>KY</given-names></string-name>, <string-name><surname>Abu-Salih</surname> <given-names>B</given-names></string-name>, <string-name><surname>Qaddoura</surname> <given-names>R</given-names></string-name>, <string-name><surname>Al-Zoubi</surname> <given-names>AM</given-names></string-name>, <string-name><surname>Palade</surname> <given-names>V</given-names></string-name>, <string-name><surname>Pham</surname> <given-names>DS</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Deep neural networks in the cloud: review, applications, challenges and research directions</article-title>. <source>Neurocomputing</source>. <year>2023</year>;<volume>545</volume>(<issue>1</issue>):<fpage>126327</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neucom.2023.126327</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>ZH</given-names></string-name></person-group>. <chapter-title>Ensemble learning</chapter-title>. In: <source>Encyclopedia of biometrics</source>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2015</year>. p. <fpage>411</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-1-4899-7488-4_293</pub-id>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>ZH</given-names></string-name></person-group>. <source>Ensemble methods: foundations and algorithms</source>. <publisher-loc>Boca Raton, FL, USA</publisher-loc>: <publisher-name>Chapman and Hall/CRC</publisher-name>; <year>2012</year>. doi:<pub-id pub-id-type="doi">10.1201/b12207</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Brochu</surname> <given-names>E</given-names></string-name>, <string-name><surname>Cora</surname> <given-names>VM</given-names></string-name>, <string-name><surname>de Freitas</surname> <given-names>N</given-names></string-name></person-group>. <article-title>A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning</article-title>. <comment>arXiv:1012.2599. 2010</comment>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ghourabi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>An attention-based approach to enhance the detection and classification of android malware</article-title>. <source>Comput Mater Contin</source>. <year>2024</year>;<volume>80</volume>(<issue>2</issue>):<fpage>2743</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2024.053163</pub-id>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Anderson</surname> <given-names>HS</given-names></string-name>, <string-name><surname>Roth</surname> <given-names>P</given-names></string-name></person-group>. <article-title>EMBER: an open dataset for training static PE malware machine learning models</article-title>. <comment>arXiv:1804.04637</comment>. <year>2018</year>.</mixed-citation></ref>
</ref-list>
</back></article>