<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">28058</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.028058</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Deep Learning-Based Program-Wide Binary Code Similarity for Smart Contracts</article-title>
<alt-title alt-title-type="left-running-head">Deep Learning-based Program-Wide Binary Code Similarity for Smart Contracts</alt-title>
<alt-title alt-title-type="right-running-head">Deep Learning-Based Program-Wide Binary Code Similarity for Smart Contracts</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhuang</surname><given-names>Yuan</given-names>
</name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Baobao</given-names>
</name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Sun</surname><given-names>Jianguo</given-names>
</name><xref ref-type="aff" rid="aff-2">2</xref><email>sunjianguo@hrbeu.edu.cn</email></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Liu</surname><given-names>Haoyang</given-names>
</name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Yang</surname><given-names>Shuqi</given-names>
</name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Da</surname><given-names>Qingan</given-names>
</name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Harbin Engineering University</institution>, <addr-line>Harbin, 150000</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>University of Sanya</institution>, <addr-line>Sanya, 572000</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>University of Alberta</institution>, <addr-line>Edmonton, T5J4P6</addr-line>, <country>Canada</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jianguo Sun. Email: <email>sunjianguo@hrbeu.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-08-16"><day>16</day>
<month>08</month>
<year>2022</year></pub-date>
<volume>74</volume>
<issue>1</issue>
<fpage>1011</fpage>
<lpage>1024</lpage>
<history>
<date date-type="received">
<day>01</day>
<month>2</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>5</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Zhuang et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Zhuang et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_28058.pdf"></self-uri>
<abstract>
<p>Recently, security issues of smart contracts are arising great attention due to the enormous financial loss caused by vulnerability attacks. There is an increasing need to detect similar codes for hunting vulnerability with the increase of critical security issues in smart contracts. Binary similarity detection that quantitatively measures the given code diffing has been widely adopted to facilitate critical security analysis. However, due to the difference between common programs and smart contract, such as diversity of bytecode generation and highly code homogeneity, directly adopting existing graph matching and machine learning based techniques to smart contracts suffers from low accuracy, poor scalability and the limitation of binary similarity on function level. Therefore, this paper investigates graph neural network to detect smart contract binary code similarity at the program level, where we conduct instruction-level normalization to reduce the noise code for smart contract pre-processing and construct contract control flow graphs to represent smart contracts. In particular, two improved Graph Convolutional Network (GCN) and Message Passing Neural Network (MPNN) models are explored to encode the contract graphs into quantitatively vectors, which can capture the semantic information and the program-wide control flow information with temporal orders. Then we can efficiently accomplish the similarity detection by measuring the distance between two targeted contract embeddings. To evaluate the effectiveness and efficient of our proposed method, extensive experiments are performed on two real-world datasets, i.e., smart contracts from Ethereum and Enterprise Operation System (EOS) blockchain-based platforms. The results show that our proposed approach outperforms three state-of-the-art methods by a large margin, achieving a great improvement up to 6.1% and 17.06% in accuracy.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Smart contract</kwd>
<kwd>similarity detection</kwd>
<kwd>neural network</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>As one of the most successful applications of blockchain technology, smart contracts enable people make agreements while minimizing trusts, which are deployed in various decentralized applications [<xref ref-type="bibr" rid="ref-1">1</xref>]. However, many critical security vulnerabilities within smart contracts on Ethereum platform have caused huge financial losses to users [<xref ref-type="bibr" rid="ref-2">2</xref>]. Hence, the security analysis of smart contract has become a new trending of academic research [<xref ref-type="bibr" rid="ref-3">3</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>]. Binary code similarity analysis (BCSA) quantitatively measures the similarities between two or more pieces of binary code, which has been widely adopted in diverse security applications such as plagiarism detection [<xref ref-type="bibr" rid="ref-7">7</xref>], malware detection [<xref ref-type="bibr" rid="ref-8">8</xref>], and vulnerability discovery [<xref ref-type="bibr" rid="ref-9">9</xref>]. Comparing binary code is especially fundamental for smart contracts where the contract source code is not available. For instance, only about 2 percent of the top 1.5 million smart contracts deployed on the blockchain disclose the source code on the Ethereum browser Etherscan [<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<p>Conventional BCSA approaches mainly aimed at detecting the similarity between binary functions [<xref ref-type="bibr" rid="ref-11">11</xref>], such as raw feature-based bug search [<xref ref-type="bibr" rid="ref-12">12</xref>], but they are unable to deal with the opcode reordering issue caused by different compilations. Recently, graph embedding-based methods are proposed to solve BCSA since machine learning has shown great success that lead to promising results in program analysis [<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
<p>Despite the surging research interest in BCSA, it is significantly challenging to perform new research in smart contract for several reasons:(1) high reusability of smart contracts, which disenable existing methods directly employed on the smart contract binary code. (2) Prior works mostly focus on solving the BCSA issues at the token or function level, which is less applicable for the desired scenarios in smart contracts.</p>
<p>To solve the above problem, this paper proposes a neural network-based binary similarity detection on smart contract. We construct the binary code of smart contract as a program-wide Control Flow Graph (CFG) and employ graph neural networks to learn the contract representation. Particularly, we explore the improved GCN and MPNN models to capture the semantic information and the program-wide control flow information with temporal orders, leading to encouraging results of binary similarity detection in smart contract.</p>
<p>In conclusion, we summarize our contributions as follows:
<list list-type="bullet">
<list-item>
<p>We propose an end-to-end method based on graph neural network to solve the program-wide binary code similarity detection in smart contracts.</p></list-item>
<list-item>
<p>We propose a temporal graph neural network to learn the contract representation for similarity detection, which explicitly capture both the semantic and the temporal information to generate graph embeddings.</p></list-item>
<list-item>
<p>We conduct extensive experiments on two real-world smart contract datasets, and the results demonstrate our approach outperforms state-of-the-art methods on both accuracy and efficiency.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>In this section, we will briefly discuss the related work focusing on code similarity, which have been widely applied for bug search, plagiarism detection and vulnerability discovery [<xref ref-type="bibr" rid="ref-14">14</xref>].</p>
<p>For learning-based bug search approach [<xref ref-type="bibr" rid="ref-15">15</xref>], many researchers have worked on the problem of raw feature-based bug search in binaries, and made great progress in this direction. In general, they rely on various raw features extracted directly from binaries for code similarity matching. N-grams or N-perms [<xref ref-type="bibr" rid="ref-16">16</xref>] are two early approaches which adopt the binary sequence or mnemonic code matching without understanding the semantics of code, so they cannot solve the opcode reordering issue caused by different compilations. To further improve accuracy, the tracelet-based approach [<xref ref-type="bibr" rid="ref-17">17</xref>] captures execution sequences as features for code similarity checking, which solve the problem of opcode changes. Tree Edit Distance Based Equational Matching (TEDEM) [<xref ref-type="bibr" rid="ref-18">18</xref>] captures semantics using the expression tree for each basic block. However, the opcode and register names are different across architectures, so these two approaches are not suitable for finding bugs across architectures. The graph embedding used in graph analysis has two different meanings. The first one is to embed the nodes of a graph. This means finding a map from the nodes to a vector space, so that the structural information of the graph is preserved. In recent years, more and more works have adopted deep learning-based method to process large-scale graph datasets.</p>
<p>Another research line of graph embedding explored in this paper is to learn vectors that represent the entire graph, from tradition image processing [<xref ref-type="bibr" rid="ref-19">19</xref>&#x2013;<xref ref-type="bibr" rid="ref-22">22</xref>] to program analysis [<xref ref-type="bibr" rid="ref-23">23</xref>]. Inspired by this, more researchers apply machine learning methods to handle tasks such as protein design and graph analysis [<xref ref-type="bibr" rid="ref-24">24</xref>]. Currently, the kernel method is a technology widely used to process structured data such as sequences and graphs.</p>
<p>The key to the kernel method is a carefully designed kernel function (a positive semi-definite function between a pair of nodes). For example, [<xref ref-type="bibr" rid="ref-25">25</xref>] counts specific subtree patterns in a graph; [<xref ref-type="bibr" rid="ref-26">26</xref>] counts the appearance of subgraph with specific sizes, where different structures will be counted in a process named Weisfeiler-Lehman (WL) algorithm. However, the kernels in these methods are fixed before learning, so the embedding space may suffer large dimensions. To overcome this problem, we explore graph neural network-based methods to learn both graph structure of smart contract CFGs and semantic information by extracting contract features.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Problem Statement</title>
<p><bold>Problem formulation.</bold> Presented with a pair of smart contract binary codes, we focus on designing a fully automated approach that can identify binary similarity at the program level. That is, the label <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> for each smart contract binary pair, denoted by SP, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> &#x003D; 1 represents contracts in SP are similar at a certain degree while <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> &#x003D; 0 denotes the pair are not similar contracts. In this paper, we focus on two types of smart contracts.</p>
<p><bold>Ethereum smart contract.</bold> Ethereum is an open-source public blockchain platform with smart contract functions. It provides decentralized Ethernet virtual machines to handle point-to-point contracts through its special cryptocurrency. Ethereum&#x2019;s smart contract is not a common contract in reality, but an application executed by Ethereum virtual machine. These applications can be used to implement certain predetermined rules. In current releases of Ethereum, the smart contract code is executed on the Ethereum Virtual Machine (EVM). Developers can write smart contracts using Solidity, a high-level programming language [<xref ref-type="bibr" rid="ref-27">27</xref>], which are then compiled into EVM bytecode. For example, the smart contract named Owned in <xref ref-type="fig" rid="fig-1">Fig. 1</xref> provides a function for transferring ownership. After compilation, the contract is converted from the source code to its bytecode format. Then the binary code of smart contract is constructed as a program-wide CFG generated by Octopus [<xref ref-type="bibr" rid="ref-28">28</xref>], a security analysis tool translating bytecode into assembly representation and control flow graphs, in which the contract graph is represented by block nodes and edges referring to the jump relationship among blocks.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>An example of Ethereum source code, bytecode and CFG</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_28058-fig-1.png"/>
</fig>
<p><bold>EOS smart contract. Enterprise Operation System</bold> (EOS) platform [<xref ref-type="bibr" rid="ref-29">29</xref>] is an open source public blockchain platform that focuses on the scalability of transaction speed. WebAssembly (WASM) [<xref ref-type="bibr" rid="ref-30">30</xref>] is a binary instruction format for a stack-based virtual machine, being adopted by the EOS blockchain platform for better efficiency and reliability. As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the source code of a EOS smart contract is compiled into WASM bytecode for execution within the WASM Virtual Machine (VM). And the Application Binary Interfaces (ABIs) describe the public interfaces of the smart contract to interact with. Every EOS smart contract must provide an apply function as the entrance function to handle actions. For example, the transfer function of a smart contract is usually used to handle transfer actions related to the contract [<xref ref-type="bibr" rid="ref-31">31</xref>]. The apply function utilizes the receiver, code, and action input parameters as filters to map the actions to the corresponding functions [<xref ref-type="bibr" rid="ref-32">32</xref>].</p>
</sec>
<sec id="s4">
<label>4</label>
<title>Our Method</title>
<p><bold>Method overview.</bold> The overall workflow of the method includes three stages: (1) graph generation phase, in which the graph representation is constructed from each targeted smart contract bytecode. (2) graph embedding phase, in which two improved neural networks MPNN and GCN are used to aggregate the information of each node in CFGs and learn a high-level embedding for each contract graph. (3) Similarity comparison phase calculates the consistent distance between the two embeddings to identify the similarity of each given contract pair.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Graph Generation</title>
<p>Our work mainly focuses on the binary code of smart contract and compares the similarity of the two binary contracts. To this end, we adopt Octopus (the classic security analysis framework of smart contracts) to deal with smart contracts in bytecode format on Ethereum and WASM format on EOS. Although there are many differences between the binary code of smart contract on Ethereum and EOS platforms, the graph representation generated by Octopus is relatively similar when dealing with these two binary codes. In the phase of graph generation, we collect block nodes and edges to construct smart contract CFGs, where the node set contains all basic blocks consisted of the instruction set. The edge set denotes the jump relationships among blocks. Then, we use word2vec to convert the block nodes into vector representation.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>An example of EOS source code, bytecode and CFG</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_28058-fig-2.png"/>
</fig>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Overall structure of our improved model</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_28058-fig-3.png"/>
</fig>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Embedding Network</title>
<p>Our graph embedding network is inspired from the classic graph neural networks GCN and MPNN. Given a graph pair as <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>p</mml:mi><mml:mo>=&#x003C;</mml:mo><mml:mi>g</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x003E;</mml:mo></mml:math></inline-formula> where <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>g</mml:mi><mml:mo>=&#x003C;</mml:mo><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>E</mml:mi><mml:mo>&#x003E;</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>V</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>E</mml:mi></mml:math></inline-formula> are the sets of blocks and edges respectively. The embedding network will compute a <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>p</mml:mi></mml:math></inline-formula> dimensional feature <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> for each block <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>v</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula> and then the embedding vector <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> of <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>g</mml:mi></mml:math></inline-formula> will be computed as an aggregation of these block embeddings.</p>
<sec id="s4_2_1">
<label>4.2.1</label>
<title>Improved GCN</title>
<p>GCN proposes to apply convolutional neural networks to graph-structured data, which develops a layer-wise propagation network as:</p>
<p><disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:msup><mml:mi>D</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:msup><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mover><mml:msup><mml:mi>D</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:msup><mml:mo>&#x005E;</mml:mo></mml:mover></mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> &#x003D; A &#x002B; I is the adjacency matrix (A) enhanced with self-loops (I), <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the feature matrix of layer l, and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a trainable weight matrix. In the equation, the diagonal node degree matrix D&#x005E; is used to normalize <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>.</p>
<p>When the node vector is output from the hidden layer, it is equivalent to encoding the nodes. Let <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> be the final hidden state of the <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msup><mml:mi>i</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> nodes. We may generate the graph representation <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> by</p>
<p><disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:munderover><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> denotes the number of major nodes.</p>
</sec>
<sec id="s4_2_2">
<label>4.2.2</label>
<title>Improved MPNN</title>
<p>MPNN consists of a message propagation phase and a readout phase. In the message propagation phase, MPNN passes information along the edges successively by following their temporal order. Then, MPNN computes a label for the entire graph G by using a readout function, which aggregates the final states of all nodes in G.</p>
<p>Formally, a contract graph is expressed by G &#x003D; {V, E}, where V consists of all major nodes and E contains all edges. Denote E &#x003D; {<inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>,&#x2026;, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> }, where <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the <italic>k</italic>-th temporal edge.</p>
<p><bold>Message propagation phase.</bold> Messages are passed along the edges, one edge per time step. At time step 0, the hidden state <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> for each node <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is initialized with the feature of <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. At time step <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>k</mml:mi></mml:math></inline-formula>, message flows through the <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msup><mml:mi>k</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> temporal edge <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and updates the hidden state of <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>V</mml:mi><mml:mrow><mml:mi>e</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, namely the end node of <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Particularly, message mk is computed basing on <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the hidden state of the starting node of <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and the edge type <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p><disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where &#x2295; denotes concatenation operation, matrix <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and bias vector <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>b</mml:mi></mml:math></inline-formula> are network parameters. The original message <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> contains information from the starting node of <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and edge <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> itself, which are then transformed into a vector embedding using <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>b</mml:mi></mml:math></inline-formula>.</p>
<p>After receiving the message, the end node of <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> updates its hidden state <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> by aggregating information from the incoming message and its previous state. Formally, <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is updated according to:</p>
<p><disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>U</mml:mi><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>Z</mml:mi><mml:msub><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p><disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">x</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>U</mml:mi><mml:mo>,</mml:mo><mml:mi>Z</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi></mml:math></inline-formula> are matrices, while <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> are bias vectors.</p>
<p><bold>Readout phase.</bold> After successively traversing all the edges in G, MPNN computes the graph embedding for G by reading out the final hidden states of all nodes. Let <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> be the final hidden state of the <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msup><mml:mi>i</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> node, we may generate the graph embedding <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> by</p>
<p><disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:munderover><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>f</mml:mi></mml:math></inline-formula> is a mapping function, e.g., a neural network, and <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> denotes the number of graph nodes.</p>
</sec>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Similarity Comparison</title>
<p>We use Siamese architecture to implement the same two graph embedded network. The Siamese architectures is a popular network architecture among tasks that involve finding similarity between two comparable things, which has been adopted by existing BCSA methods with good results [<xref ref-type="bibr" rid="ref-33">33</xref>]. Each graph embedding network will take a CFG as its input and output the embedded graph <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>.The final output of Siamese architecture is the cosine distance of these two embedded contracts. In addition, the two embedding networks share the same parameter set. Therefore, during the training, the two networks remain same. Given a graph pair <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>p</mml:mi><mml:mo>=&#x003C;</mml:mo><mml:mi>g</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x003E;</mml:mo></mml:math></inline-formula>, with ground truth pairing information <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where y &#x003D; 1 indicates that <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>g</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are similar, otherwise <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow></mml:math></inline-formula> &#x003D; &#x2212;1.</p>
<p><disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x003C;</mml:mo><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x003E;</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>where <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mrow><mml:mover><mml:mi>g</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is the graph embedding representation generated by the improved GCN and MPNN.</p>
<p>At the same time, to train the model parameters for the above models, we will optimize the following objective function as follows:</p>
<p><disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>We can optimize the objective of Expression <xref ref-type="disp-formula" rid="eqn-9">(9)</xref> with stochastic gradient descent. The gradients of the parameters are calculated recursively according to the graph topology. In the end, once the Siamese network can achieve a good performance, the training process terminates.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiments</title>
<sec id="s5_1">
<label>5.1</label>
<title>Experimental Settings</title>
<p><bold>Datasets.</bold> Extensive experiments are conducted on two datasets of real-world binary contracts collected from the Ethereum and EOS platforms. Particularly, we collected the source code files of 44096 Ethereum smart contracts [<xref ref-type="bibr" rid="ref-24">24</xref>], which roughly contain 230452 independent smart contracts. After compilation, disassembly and de duplication, there are 3250 distinct contracts in the Ethereum dataset. For EOS, we collected 3881 real-world smart contracts [<xref ref-type="bibr" rid="ref-34">34</xref>]. After deduplication, there are 2306 contract binaries left in the EOS dataset. Then, we construct a series of similar contract pairs tagged as positive samples and a certain of dissimilar contract pairs tagged as negative samples for each distinct contract.</p>
<p><bold>Compared methods.</bold> We compare our proposed approaches (improved GCN and MPNN) with a traditional graph matching method and two deep learning method. The traditional graph matching methods is WL [<xref ref-type="bibr" rid="ref-35">35</xref>], which calculates the structural similarity of two graphs based on subtree. The two deep learning methods are Density-Based Spatial Clustering of Application with Noise (DBSCAN) [<xref ref-type="bibr" rid="ref-36">36</xref>] and Gemini [<xref ref-type="bibr" rid="ref-33">33</xref>]. DBSCAN is a density-based spatial clustering algorithm while Gemini is a neural network-based method, which uses Structure2vec to compute the graph embedding of CFGs and identify whether the binary codes of two traditional high-level programs are similar. For the neural network-based methods, we randomly pick 80% contracts from each dataset as the training set while the remaining are utilized for the test set.</p>
<p><bold>Metrics.</bold> In the comparison, classic metrics for BCSA such as accuracy, recall, precision, and F1 score are all involved. Particularly, the results of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are used to compute the above metrics. The true values represent the number of correctly predicted results, which can be either true positive or true negative. The false values indicate that our model gives the wrong output.</p>
<p>The precision metric describes the ratio of truly positive values to all positive predictions. This indicates the reliability of the classifier&#x2019;s positive prediction. The recall metric shows the proportion of actual positives that are correctly classified. The formulas to compute these two metrics are given below:</p>
<p><disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mi mathvariant="italic">P</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p><disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:mi mathvariant="italic">R</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">l</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>The F1 score metric is commonly used in information retrieval and it quantifies the overall decision accuracy using precision and recall. The F1 score is defined as the harmonic mean of the precision and recall:</p>
<p><disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mi mathvariant="italic">P</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mi mathvariant="italic">R</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">l</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="italic">P</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="italic">R</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">l</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Notice that, the best and the worst value of the F1 score is 1 and 0, respectively. The F1 score can be calculated for each class label or globally. In our evaluation, we use the weighted F1 score where the per-class F1 scores are weighted by the number of samples from that class.</p>
<p>Finally, the accuracy metric describes the effectiveness of our methods, which represents the correct proportion of all samples:</p>
<p><disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mrow><mml:mi mathvariant="italic">A</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Results Analysis</title>
<p>Performance comparison in terms of the above metric is shown in <xref ref-type="table" rid="table-1">Tab. 1</xref>, where we compare the proposed method (i.e., improved GCN and MPNN models) with existing approaches on the collected datasets. Meanwhile, we illustrate the effectiveness of our models by evaluating their ROCs in <xref ref-type="fig" rid="fig-4">Figs. 4</xref> and <xref ref-type="fig" rid="fig-5">5</xref>. Smart contracts on Ethereum and EOS are so different in instruction and size that we elaborately discuss the experimental results respectively.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Performance comparison in terms of accuracy, recall, precision and F1 score</title>
</caption>
<table frame="hsides">
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Methods</th>
<th colspan="4">Ethereum dataset</th>
<th colspan="4">EOS dataset</th>
</tr>
<tr>
<th>Acc (%)</th>
<th>Recall (%)</th>
<th>Precision (%)</th>
<th>F1 (%)</th>
<th>Acc (%)</th>
<th>Recall (%)</th>
<th>Precision (%)</th>
<th>F1 (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>WL</td>
<td>56.99</td>
<td>15.41</td>
<td>53.82</td>
<td>40.25</td>
<td>50.72</td>
<td>14.39</td>
<td>50.36</td>
<td>33.98</td>
</tr>
<tr>
<td>Gemini</td>
<td>85.66</td>
<td>71.33</td>
<td>77.72</td>
<td>74.92</td>
<td>72.55</td>
<td>56.86</td>
<td>67.16</td>
<td>62.71</td>
</tr>
<tr>
<td>DBSCAN</td>
<td>74.55</td>
<td>63.44</td>
<td>70.09</td>
<td>67.10</td>
<td>75.72</td>
<td>67.41</td>
<td>72.05</td>
<td>69.91</td>
</tr>
<tr>
<td>GCN</td>
<td><bold>80.61</bold></td>
<td><bold>82.12</bold></td>
<td><bold>79.71</bold></td>
<td><bold>80.89</bold></td>
<td><bold>88.77</bold></td>
<td><bold>97.71</bold></td>
<td><bold>82.89</bold></td>
<td><bold>89.69</bold></td>
</tr>
<tr>
<td>MPNN</td>
<td><bold>91.76</bold></td>
<td><bold>93.19</bold></td>
<td><bold>90.59</bold></td>
<td><bold>93.09</bold></td>
<td><bold>89.61</bold></td>
<td><bold>93.55</bold></td>
<td><bold>86.71</bold></td>
<td><bold>93.28</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>ROC analysis for GCN, MPNN on Ethereum dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_28058-fig-4.png"/>
</fig>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>ROC analysis for GCN, MPNN on EOS dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_28058-fig-5.png"/>
</fig>
<sec id="s5_2_1">
<label>5.2.1</label>
<title>Comparison on Ethereum Dataset</title>
<p>Firstly, we compare the improved GCN and MPNN methods with WL, DBSCAN and Gemini in the similarity detection of Ethereum binary contracts. <xref ref-type="table" rid="table-1">Tab. 1</xref> shows the performance of different methods via the metric accuracy, recall, accuracy and F1 score.</p>
<p>From the quantitative results in <xref ref-type="table" rid="table-1">Tab. 1</xref>, we have the following observations. Firstly, it confirms that the traditional method has not achieved satisfied accuracy in the similarity detection. For example, the accuracy of WL is 56.45%. Second, the improved MPNN method has achieved great improvement over the traditional method. More specifically, MPNN achieves an accuracy of 91.76%. Compared with the traditional method, the accuracy has been increased by 35.31%. Third, the improved GCN also obtains better results than the traditional method. The empirical results prove that the application of graph neural network to binary contract similarity detection has great potential. We further study the traditional similarity detection tool to explore the reason behind these observations. WL heavily relies on graphic structure while ignoring the semantic information within blocks, leading to its low accuracy and other metrics.</p>
<p>To verify if the proposed neural network-based methods can successfully detect the similarity of Ethereum smart contracts, we compare our method with the well-known deep learning method in BCSA, i.e., Gemini. Experimental results show that the performance of Gemini is better than the traditional method WL, the clustering method DBSCAN and the improved GCN. This informs that only considering both graphical information and semantic information can outperform in similarity detection. In the meanwhile, we want to emphasize that the improved MPNN model achieves the high scores in all four indicators, since more messages passed by edges contribute to the final embeddings of contract CFGs. ROCs shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref> also illustrate the effectiveness of our two proposed model, the larger the ROC area, the stronger the similarity detection ability.</p>
</sec>
<sec id="s5_2_2">
<label>5.2.2</label>
<title>Comparison on EOS Dataset</title>
<p>We list comparison results of the binary similarity detection of EOS contracts in <xref ref-type="table" rid="table-1">Tab. 1</xref>. It indicates that static method fails to identify contract similarity when processing complex smart contracts, where the accuracy rate is only about 50%. However, our methods are able to deal with complex programs, though the results of MPNN do not change much but GCN&#x2019;s are significantly improved. This confirms that GCN is relatively good at processing complex graph information. The reason is that when MPNN and GCN learn a given contract CFG, they take the semantic information of each node into consideration. Then this information will be transmitted to neighboring nodes along with the edge message between nodes, and finally obtain a complete graph representation of the CFG that is more capable of expressing the complex contracts.</p>
<p>Compared to the neural network methods, the ability dealing with complex programs of Gemini is significantly reduced, while the performance of DBSCAN is not influenced by the program complexity. At the same time, two improved models we propose outperform than these two methods, especially when solving the BSCA problem in terms of binary contracts. This is because Gemini deals with traditional high-level languages, but smart contracts and traditional high-level languages have huge differences in functionality and implementation. It also can be observed from <xref ref-type="fig" rid="fig-5">Fig. 5</xref> where our models have achieved promising ROCs.</p>
</sec>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>Case Study</title>
<p>The goal of our proposed graph neural network models is to better understand and identify the similarity of the given smart contract binary code pair. As illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, we elaborate the proposed workflow with an example of two similar smart contract, which are generated by the same smart contract with different compiler versions.</p>
<p>Understanding binary code is a difficult problem, so we firstly generate graph representations, i.e., CFGs, of these two binary contracts shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. Basic blocks of different instruction sets are modeled as nodes and control flow relationship between nodes are modeled as edges in CFGs. To clearly show the dissimilar parts of these contract graphs, we use &#x2018;&#x2026;&#x2019; to represent the similar blocks in the CFGs and add the serial number to identify different blocks. Then we utilize word2vec to convert the block nodes into vector representation and feed to the next graph embedding network. In the graph embedding phase, we exploit the proposed graph neural network (e.g., the well trained CGN or MPNN model) to encode the input graph representation into a high-level embedding. In other words, each graph embedding network will take a CFG as its input and output the graph embeddings. Lastly, we use the cosine distance of these two embeddings to detect whether the pair of smart contracts is similar or not. In this case, the detection result is similar, which proves the accuracy of our proposed method in the similarity detection.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Example CFGs of a similar smart contract pair</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_28058-fig-6.png"/>
</fig>
<p>From the <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, we can see both the graph structure of CFGs and the instruction sets of basic blocks have a certain difference between the contract pair. For example, Block1, block10, and block38 with some stack operation instructions like DUP1, POP, have little effect on instruction analysis. In contrast, there are more useful instructions in block10. By learning the graph features, i.e., the semantic information of these blocks and the graph structure of the binary contract pair, our proposed model can correctly determine that the given binary files of contract pair are similar.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>In this paper, we proposed a deep learning-based scheme for program-wide binary code similarity of smart contracts, in which improved GCN and MPNN models are exploited for similarity detection of two given binary contracts. We used control-flow graph (CFG) to represent binary code of smart contract, and then graph neural network is adopted to generate the graph embedding. We then employed the Siamese Network for integrating two identical graph neural networks to calculate the similarity between two contract encodings. As far as we concerned, this is the first work that apply the similarity detection method to the binary contracts. For the model training, we built two real-world datasets from two well-known blockchain platforms, i.e., Ethereum and EOS respectively, containing 49,725 binary smart contracts in total. Compared with the state-of-art methods, we have achieved promising results on the accuracy of similarity detection. Evaluation results show that our method outperforms three state-of-the-art methods by achieving a great improvement up to 6.1% and 17.06% in accuracy. We believe that this is also an important step to continue to study the task of binary code similarity of smart contract.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This work is supported by the Basic Research Program (No. JCKY2019210B029) and Network threat depth analysis software (KY10800210013).</p>
</fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yin</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>A comprehensive survey on smart contract construction and execution: Paradigms, tools, and systems</article-title>,&#x201D; <source>Patterns</source>, vol. <volume>2</volume>, no. <issue>2</issue>, pp. <fpage>100179</fpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Atzei</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Bartoletti</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Cimoli</surname></string-name></person-group>, &#x201C;<article-title>A survey of attacks on Ethereum smart contracts (SoK)</article-title>,&#x201D; in <conf-name>Int. Conf. on Principles of Security and Trust</conf-name>, <publisher-loc>Heidelberg, Berlin</publisher-loc>, pp. <fpage>164</fpage>&#x2013;<lpage>186</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Han</surname></string-name>, <string-name><given-names>W.</given-names> <surname>You</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Liang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Hunting vulnerable smart contracts via graph embedding based bytecode matching</article-title>,&#x201D; <source>IEEE Transactions on Information Forensics and Security</source>, vol. <volume>16</volume>, pp. <fpage>2144</fpage>&#x2013;<lpage>2156</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>W. K.</given-names> <surname>Chan</surname></string-name></person-group>, &#x201C;<article-title>ContractFuzzer: Fuzzing smart contracts for vulnerability detection</article-title>,&#x201D; in <conf-name>2018 33rd IEEE/ACM Int. Conf. on Automated Software Engineering (ASE)</conf-name>, <publisher-loc>Montpellier, France</publisher-loc>, pp. <fpage>259</fpage>&#x2013;<lpage>269</lpage>, <year>2018</year>. </mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Qian</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhuang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Qiu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Combining graph neural networks with expert knowledge for smart contract vulnerability detection</article-title>,&#x201D; <source>IEEE Transactions on Knowledge and Data Engineering, Early Access</source>, vol. 2021, pp. <fpage>1</fpage>, <year>2021</year>. <uri xlink:href="https://doi.org/10.1109/TKDE.2021.3095196">https://doi.org/10.1109/TKDE.2021.3095196</uri>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Luu</surname></string-name>, <string-name><given-names>D. H.</given-names> <surname>Chu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Olickel</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Saxena</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Hobor</surname></string-name></person-group>, &#x201C;<article-title>Making smart contracts smarter</article-title>,&#x201D; in <conf-name>Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security</conf-name>, <publisher-loc>Vienna, Austria</publisher-loc>, pp. <fpage>254</fpage>&#x2013;<lpage>269</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Jang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Brumley</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Venkataraman</surname></string-name></person-group>, &#x201C;<article-title>Bitshred: Feature hashing malware for scalable triage and semantic analysis</article-title>,&#x201D; in <conf-name>Proc. of the 18-th ACM Conf. on Computer and Communications Security</conf-name>, <publisher-loc>Chicago, Illinois, USA</publisher-loc>, pp. <fpage>309</fpage>&#x2013;<lpage>320</lpage>, <year>2011</year>. </mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ming</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Zhu</surname></string-name></person-group>, &#x201C;<article-title>Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection</article-title>,&#x201D; in <conf-name>Proc. of the 22nd Int. Symp. on Foundations of Software Engineering</conf-name>, <publisher-loc>Hong Kong, China</publisher-loc>, pp. <fpage>389</fpage>&#x2013;<lpage>400</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Meng</surname></string-name>, <string-name><given-names>B. P.</given-names> <surname>Miller</surname></string-name> and <string-name><given-names>K. S.</given-names> <surname>Jun</surname></string-name></person-group>, &#x201C;<article-title>Identifying multiple authors in a binary program</article-title>,&#x201D; in <conf-name>Proc. of the 22nd European Symp. on Research in Computer Security</conf-name>, <publisher-loc>Oslo, Norway</publisher-loc>, pp. <fpage>286</fpage>&#x2013;<lpage>304</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Etherscan</collab></person-group>, &#x201C;<article-title>The Ethereum block explorer</article-title>,&#x201D; <year>2021</year>. [Online]. Available: <uri>https://etherscan.io/</uri>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>Feng</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Testa</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Scalable graph-based bug search for firmware images</article-title>,&#x201D; in <conf-name>Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security</conf-name>, <publisher-loc>Vienna, Austria</publisher-loc>, pp. <fpage>480</fpage>&#x2013;<lpage>491</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Zuo</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Young</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Zeng</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Neural machine translation inspired binary code similarity comparison beyond function pairs</article-title>,&#x201D; in <conf-name>Proc. of the 2019 Network and Distributed System Security Symp.</conf-name>, <publisher-loc>San Diego, California, USA</publisher-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>15</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Feng</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Duan</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Feng</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Codebert: A pre-trained model for programming and natural languages</article-title>,&#x201D; in <conf-name>Findings of the Association for Computational Linguistics: EMNLP 2020</conf-name>, Online, pp. <fpage>1536</fpage>&#x2013;<lpage>1547</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Kivinen</surname></string-name>, <string-name><given-names>A. J.</given-names> <surname>Smola</surname></string-name> and <string-name><given-names>R. C.</given-names> <surname>Williamson</surname></string-name></person-group>, &#x201C;<article-title>Online learning with kernels</article-title>,&#x201D; <source>IEEE Transactions on Signal Processing</source>, vol. <volume>52</volume>, no. <issue>8</issue>, pp. <fpage>2165</fpage>&#x2013;<lpage>2176</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>Feng</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Testa</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Scalable graph-based bug search for firmware images</article-title>,&#x201D; in <conf-name>ACM Conf. on Computer and Communications Security (CCS&#x2019;16)</conf-name>, <publisher-loc>Vienna, Austria</publisher-loc>, pp. <fpage>480</fpage>&#x2013;<lpage>491</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W. M.</given-names> <surname>Khoo</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Mycroft</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Anderson</surname></string-name></person-group>, &#x201C;<article-title>Rendezvous: A search engine for binary code</article-title>,&#x201D; in <conf-name>2013 10th Working Conf. on Mining Software Repositories (MSR)</conf-name>, <publisher-loc>San Francisco, CA, USA</publisher-loc>, pp. <fpage>329</fpage>&#x2013;<lpage>338</lpage>, <year>2013</year>. </mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>David</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Yahav</surname></string-name></person-group>, &#x201C;<article-title>Tracelet-based code search in executables</article-title>,&#x201D; in <conf-name>Proc. of the 35th ACM SIGPLAN Conf. on Programming Language Design and Implementation</conf-name>, <publisher-loc>New York, NY</publisher-loc>, <volume>49</volume>, pp. <fpage>349</fpage>&#x2013;<lpage>360</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Pewny</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Schuster</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Bernhard</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Holz</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Rossow</surname></string-name></person-group>, &#x201C;<article-title>Leveraging semantic signatures for bug search in binary programs</article-title>,&#x201D; in <conf-name>Proc. of the 30th Annual Computer Security Applications Conf.</conf-name>, <publisher-loc>New Orleans, Louisiana, USA</publisher-loc>, pp. <fpage>406</fpage>&#x2013;<lpage>415</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X. R.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>W. F.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>X. M.</given-names> <surname>Sun</surname></string-name> and <string-name><given-names>S. K.</given-names> <surname>Jha</surname></string-name></person-group>, &#x201C;<article-title>A robust 3-D medical watermarking based on wavelet transform for data protection</article-title>,&#x201D; <source>Computer Systems Science &#x0026; Engineering</source>, vol. <volume>41</volume>, no. <issue>3</issue>, pp. <fpage>1043</fpage>&#x2013;<lpage>1056</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X. R.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>X. M.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Sun</surname></string-name> and <string-name><given-names>S. K.</given-names> <surname>Jha</surname></string-name></person-group>, &#x201C;<article-title>Robust reversible audio watermarking scheme for telemedicine and privacy protection</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>71</volume>, no. <issue>2</issue>, pp. <fpage>3035</fpage>&#x2013;<lpage>3050</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>G. Z.</given-names> <surname>Dai</surname></string-name>, <string-name><given-names>X. R.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X. Z.</given-names> <surname>He</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>TBE-Net: A three-branch embedding network with part-aware ability and feature complementary learning for vehicle re-identification</article-title>,&#x201D; <source>IEEE Transactions on Intelligent Transportation Systems</source>, pp. <fpage>1</fpage>&#x2013;<lpage>13</lpage>, <year>2021</year>. <uri xlink:href="https://doi.org/10.1109/TITS.2021.3130403">https://doi.org/10.1109/TITS.2021.3130403</uri>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Dai</surname></string-name>, <string-name><given-names>X. R.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>P. S.</given-names> <surname>Chang</surname></string-name> and <string-name><given-names>X. Z.</given-names> <surname>He</surname></string-name></person-group>, &#x201C;<article-title>RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring</article-title>,&#x201D; <source>Applied Intelligence</source>, vol. <volume>92</volume>, no. <issue>6</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>16</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Zhuang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Qian</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Wang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Smart contract vulnerability detection using graph neural network</article-title>,&#x201D; in <conf-name>IJCAI 2020</conf-name>, <publisher-loc>YoKohama, Japan</publisher-loc>, pp. <fpage>3283</fpage>&#x2013;<lpage>3290</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Ling</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Shi</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Multi-head attention graph network for few shot learning</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>68</volume>, no. <issue>2</issue>, pp. <fpage>1505</fpage>&#x2013;<lpage>1517</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Ramon</surname></string-name> and <string-name><given-names>T.</given-names> <surname>G&#x00E4;rtner</surname></string-name></person-group>, &#x201C;<article-title>Expressivity versus efficiency of graph kernels</article-title>,&#x201D; in <conf-name>Proc. of the First Int. Workshop on Mining Graphs, Trees and Sequences</conf-name>, <publisher-loc>Cavtat-Dubrovnik, Croatia</publisher-loc>, pp. <fpage>65</fpage>&#x2013;<lpage>74</lpage>, <year>2003</year>. </mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Shervashidze</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Schweitzer</surname></string-name>, <string-name><given-names>E. J.</given-names> <surname>Leeuwen</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Mehlhorn</surname></string-name> and <string-name><given-names>K. M.</given-names> <surname>Borgwardt</surname></string-name></person-group>, &#x201C;<article-title>Weisfeiler-lehman graph kernels</article-title>,&#x201D; <source>Journal of Machine Learning Research</source>, vol. <volume>12</volume>, no. <issue>9</issue>, pp. <fpage>2539</fpage>&#x2013;<lpage>2561</lpage>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Solidity</collab></person-group>, <year>2021</year>. [Online]. Available: <uri>https://solidity.readthedocs.io/en/v0.6.4/</uri>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Angelo</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Salzer</surname></string-name></person-group>, &#x201C;<article-title>A survey of tools for analyzing ethereum smart contracts</article-title>,&#x201D; in <conf-name>Proc. of the 2019 IEEE Int. Conf. on Decentralized Applications and Infrastructures (DAPPCON)</conf-name>, <publisher-loc>Berlin, German</publisher-loc>, pp. <fpage>69</fpage>&#x2013;<lpage>78</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>EOSIO</collab></person-group>, <year>2021</year>. [Online]. Available: <uri>https://eos.io/build-on-eosio/eosio-cdt/</uri>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>WebAssembly</collab></person-group>, <year>2021</year>. [Online]. Available: <uri>https://webassembly.org/</uri>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Transfer function of EOSIO smart contracts</collab></person-group>, <year>2021</year>. [Online]. Available: <uri>https://developers.eos.io</uri>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>EOSIO ABI macro and apply</collab></person-group>, <year>2021</year>. [Online]. Available: <uri>https://developers.eos.io/eosiocpp/v1.2.0/docs/abi</uri>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Feng</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Yin</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Song</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Neural network-based graph embedding for cross-platform binary code similarity detection</article-title>,&#x201D; in <conf-name>Proc. of the 2017 ACM SIGSAC Conf. on Computer and Communications Security</conf-name>, <publisher-loc>Dallas, Texas, USA</publisher-loc>, pp. <fpage>363</fpage>&#x2013;<lpage>376</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Jiang</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Chan</surname></string-name></person-group>, &#x201C;<article-title>EOSFuzzer: Fuzzing EOSIO smart contracts for vulnerability detection</article-title>,&#x201D; in <conf-name>12th Asia-Pacific Symp. on Internetware (Internetware&#x2019;20)</conf-name>, <publisher-loc>New York, USA</publisher-loc>, <publisher-name>Association for Computing Machinery</publisher-name>, pp. <fpage>99</fpage>&#x2013;<lpage>109</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Sugiyama</surname></string-name>, <string-name><given-names>M. E.</given-names> <surname>Ghisu</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Llinares-L&#x00F3;pez</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Borgwardt</surname></string-name></person-group>, &#x201C;<article-title>Graphkernels: R and Python packages for graph comparison</article-title>,&#x201D; <source>Bioinformatics</source>, vol. <volume>34</volume>, no. <issue>3</issue>, pp. <fpage>530</fpage>&#x2013;<lpage>532</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F. G.</given-names> <surname>Yasar</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Ulutagay</surname></string-name></person-group>, &#x201C;<article-title>Challenges and possible solutions to density-based clustering</article-title>,&#x201D; in <conf-name>2016 IEEE 8th Int. Conf. on Intelligent Systems</conf-name>, <publisher-loc>Sofia, Bulgaria</publisher-loc>, pp. <fpage>492</fpage>&#x2013;<lpage>498</lpage>, <year>2016</year>. </mixed-citation></ref>
</ref-list>
</back>
</article>
