<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">22659</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2022.022659</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>SSA-HIAST: A Novel Framework for Code Clone Detection</article-title>
<alt-title alt-title-type="left-running-head">SSA-HIAST: A Novel Framework for Code Clone Detection</alt-title>
<alt-title alt-title-type="right-running-head">SSA-HIAST: A Novel Framework for Code Clone Detection</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Saini</surname><given-names>Neha</given-names></name><email>neha3998akalacademy@gmail.com</email>
</contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Singh</surname><given-names>Sukhdip</given-names></name></contrib>
<aff id="aff-1"><institution>Deenbandhu Chhotu Ram University of Science and Technology</institution>, <addr-line>Murthal, 131001</addr-line>, <country>India</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Neha Saini. Email: <email>neha3998akalacademy@gmail.com</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-11-29"><day>29</day>
<month>11</month>
<year>2021</year></pub-date>
<volume>71</volume>
<issue>2</issue>
<fpage>2999</fpage>
<lpage>3017</lpage>
<history>
<date date-type="received"><day>14</day><month>8</month><year>2021</year></date>
<date date-type="accepted"><day>29</day><month>9</month><year>2021</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2022 Saini and Singh</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Saini and Singh</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_22659.pdf"></self-uri>
<abstract>
<p>In the recent era of software development, reusing software is one of the major activities that is widely used to save time. To reuse software, the copy and paste method is used and this whole process is known as code cloning. This activity leads to problems like difficulty in debugging, increase in time to debug and manage software code. In the literature, various algorithms have been developed to find out the clones but it takes too much time as well as more space to figure out the clones. Unfortunately, most of them are not scalable. This problem has been targeted upon in this paper. In the proposed framework, authors have proposed a new method of identifying clones that takes lesser time to find out clones as compared with many popular code clone detection algorithms. The proposed framework has also addressed one of the key issues in code clone detection i.e., detection of near-miss (Type-3) and semantic clones (Type-4) with significant accuracy of 95.52&#x0025; and 92.80&#x0025; respectively. The present study is divided into two phases, the first method converts any code into an intermediate representation form i.e., Hash-inspired abstract syntax trees. In the second phase, these abstract syntax trees are passed to a novel approach &#x201C;Similarity-based self-adjusting hash inspired abstract syntax tree&#x201D; algorithm that helps in knowing the similarity level of codes. The proposed method has shown a lot of improvement over the existing code clones identification methods.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Code cloning</kwd>
<kwd>clone detection</kwd>
<kwd>hash inspired abstract syntax tree</kwd>
<kwd>rotations</kwd>
<kwd>hybrid framework</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>Authors are Computer Industry has grown significantly over the past years. High-quality software and operating systems have a major role in driving this growth. Present software and operating systems are composed of millions of lines of code (LOC) that work to achieve a common objective with high efficiency and effectiveness. Software&#x0027;s are written using different programming languages like C, C&#x002B;&#x002B;, JAVA, Python, etc. and the life cycle of software has multiple phases in it. Starting with detailed requirement analysis to designing, coding, testing, and ending with maintaining it. Research [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>] has shown that out of the above-mentioned components of the life cycle of software development maintenance is the costliest part in terms of money and man-hours involved to carry it out.</p>
<p>Software maintenance is highly dependent on the practices that were used to build the software. One such practice those programmers use to write codes for software is code cloning. Code cloning is the process of using similar code fragments repeatedly in an application with some or no modifications at all. Research points out that 7&#x2013;23&#x0025; of codes are cloned in large-scale systems [<xref ref-type="bibr" rid="ref-3">3</xref>]. No Line breaks between paragraphs belonging to the same section.</p>
<sec id="s1_1"><label>1.1</label><title>Benefits of Code Cloning</title>
<p>Apart from ease of maintenance in the future, code cloning offers several other benefits like improvement in software metrics, low compilation time, less cognitive load, less human error, and fewer code fragments that are forgotten or missed. Code cloning has its roots in changing paradigms of programming languages i.e., higher use of templates in programming [<xref ref-type="bibr" rid="ref-2">2</xref>].</p>
</sec>
<sec id="s1_2"><label>1.2</label><title>Drawbacks of Code Cloning</title>
<p>To begin with, code cloning makes it extremely hard to perform modifications in codes for maintenance purposes. In a high code cloned system, for a certain modification to be done a programmer has to carefully perform the modifications in all the cloned sub-systems. This phenomenon is also known as &#x201C;bug propagation&#x201D; [<xref ref-type="bibr" rid="ref-4">4</xref>].</p>
</sec>
<sec id="s1_3"><label>1.3</label><title>Types of Code Similarity</title>
<p>Designing an effective code clone detection system requires an understanding of principles on which two codes are considered to be similar or clones of each other.</p>
<sec id="s1_3_1"><label>1.3.1</label><title>Syntactic Similarity</title>
<p>Two codes are said to be similar syntax-wise if they are similar textually.</p>
<p>&#x2022; Type 1 Clones:</p>
<p>Also known as &#x201C;exact clones&#x201D;, these are code clones that differ only in terms of white spaces and or addition/deletion of comments [<xref ref-type="bibr" rid="ref-5">5</xref>].</p>
<p>&#x2022; Type 2 Clones:</p>
<p>Also known as &#x201C;parameterized clones&#x201D; these are code clones that are slightly modified by changing variables, methods, or class names. For example, code fragments such as &#x201C;a&#x003D;b&#x002B;2&#x201D; &#x0026; &#x201C;d&#x003D;e&#x002B;2&#x201D; are Type 2 clones [<xref ref-type="bibr" rid="ref-6">6</xref>].</p>
<p>&#x2022; Type 3 Clones:</p>
<p>Also known as &#x201C;gapped clones&#x201D; are code clones that differ at the statement level. Here code fragments have statements either added, edited/modified, and or deleted in addition to Type 2 differences [<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
</sec>
<sec id="s1_3_2"><label>1.3.2</label><title>Semantic Similarity</title>
<p>Two codes are said to be similar semantically if they are similar on a functional level while completely different textually. These are Type 4 clones and are the hardest to find. For example, <?A3B2 "tbl1",5,"anchor"?><xref ref-type="table" rid="table-1">Tab. 1</xref> shows a sample python program to find the factorial of a number using recursion, and a python program to find factorial of a number using for loop (without recursion) can be considered as Type 4 clones.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Example Type-4 Clone</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Factorial using recursion</th>
<th align="left">Factorial using for loop</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2002;&#x2002;&#x2002;def recur_factorial(n):</td><td/></tr>
<tr><td>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;if n &#x003D;&#x003D; 1:</td><td align="left">num&#x2009;&#x003D;&#x2009;7</td></tr>
<tr><td>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;return n</td><td>factorial&#x2009;&#x003D;&#x2009;1</td></tr>
<tr><td>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;else:</td><td>for i in range(1, num &#x002B; 1):</td></tr>
<tr><td>&#x2002;&#x2002;&#x2002;return n&#x002A;recur_factorial(n-1)</td><td>factorial &#x003D; factorial&#x002A;i</td></tr>
<tr><td>&#x2002;&#x2002;&#x2002;num&#x2009;&#x003D;&#x2009;7</td></tr>
<tr><td>&#x2002;&#x2002;&#x2002;recur_factorial(num))</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s1_4"><label>1.4</label><title>Issues with Existing Work</title>
<p>Present code clone detection techniques have the following limitations:</p>
<sec id="s1_4_1"><label>1.4.1</label><title>Sensitivity to Type 3 &#x0026; 4 Clones</title>
<p>Most of the literature [<xref ref-type="bibr" rid="ref-8">8</xref>&#x2013;<xref ref-type="bibr" rid="ref-10">10</xref>] studied by us focuses on Type 1 &#x0026; 2 clones.</p>
</sec>
<sec id="s1_4_2"><label>1.4.2</label><title>Runtime Complexity</title>
<p>Techniques like CCFinder [<xref ref-type="bibr" rid="ref-11">11</xref>], VUDDY [<xref ref-type="bibr" rid="ref-12">12</xref>], SeClone [<xref ref-type="bibr" rid="ref-13">13</xref>], TwinFinder [<xref ref-type="bibr" rid="ref-14">14</xref>], Deckard [<xref ref-type="bibr" rid="ref-15">15</xref>] have high complexities in terms of usage of memory and processing power.</p>
</sec>
</sec>
</sec>
<sec id="s2"><label>2</label><title>Related Work</title>
<p>This section details the state-of-the-art techniques for Type 3 &#x0026; 4 code clones&#x2019; detectors along with work done in the code clone detection using machine learning.</p>
<p>According to work done by Urak, most of the code plagiarism detection is limited by a variety of source codes that they can process [<xref ref-type="bibr" rid="ref-16">16</xref>]. Furthermore, most of the techniques used for semantic code clone detection are unable to provide a heuristic solution for problems varying from statement reordering, inversion of control predicates, insertion of non-useful statements. All these could cause a bottleneck in the environment. To handle these issues tekchandani proposed a novel approach that uses data flow analysis based on liveness analysis &#x0026; reaching definition for detecting semantic clones in a procedure or a program [<xref ref-type="bibr" rid="ref-17">17</xref>].</p>
<p>In [<xref ref-type="bibr" rid="ref-18">18</xref>] Keivanloo et al. suggested the k-means clustering method as a replacement for the threshold-based cutoff phase in the clone identification process. Previous work on clone detection solved the scalability issue. As a result, they suggest a technique to aid practitioners in the use of scalable Type-3 clone detection algorithms across software systems. They are particularly concerned with enhancing performance and usability. As part of the setup, k-means is used to calculate the number of anticipated clusters. The testing results suggest that using the k-means algorithm boosts performance by 12 percent.</p>
<p>Anil et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] described a simple and effective approach for detecting precise and near-miss clones in program source code using AST. The identification of code clones is useful not only for creating more organized code fragments but also for finding domain concepts and their idiomatic implementations.</p>
<p>The author has presented a novel work that performs code clone genealogy evolution on OpenMRS, an e-health system based on git. The model is based on transitive closure computation using the Hadoop ecosystem [<xref ref-type="bibr" rid="ref-8">8</xref>]. The authors presented a parse tree kernel-based code plagiarism detection method. In terms of parse tree similarity, the parse tree kernel produces a similarity value between two source codes [<xref ref-type="bibr" rid="ref-20">20</xref>]. The system successfully handles structural information because parse trees include the key syntactic structure of source codes. This article makes two important contributions. First, they suggest a program source code-optimized parse tree kernel. This system, which is based on this kernel, outperforms well-known baseline systems, according to the evaluation. Second, they gathered a large number of real-world Java source codes from a programming class at a university. Two separate human annotators manually evaluated and labeled this test set to identify plagiarized codes. A code clone detection framework for detecting both code obfuscation &#x0026; cloning using machine learning has been given by the authors. They use features extracted from Java Bytecode dependency graphs, program dependency graphs &#x0026; abstract syntax trees [<xref ref-type="bibr" rid="ref-1">1</xref>].</p>
<p>In this paper, they focus on improving the scalability of code clone detection, relative to the current state-of-the-art techniques. Their adaptive prefix filtering technique improves the performance of code clone detection for many common execution parameters when tested on common benchmarks. The experimental results exhibit improvements for commonly used similarity thresholds of between 40&#x0025; and 80&#x0025;, in the best case decreasing the execution time up to 11&#x0025; and increasing the number of filtered candidates up to 63&#x0025; [<xref ref-type="bibr" rid="ref-21">21</xref>].</p>
<p>A DeepCRM was proposed by the authors, which is a deep learning-based model for code readability and classification. DeepCRM firstly transforms source codes into integer matrices as the input to ConvNets. DeepCRM consists of three separate ConvNets with identical structures that are trained on data pre-processed in different ways. DeepCRM shows an increase of 2.4&#x0025; to 17.2&#x0025; from previous approaches [<xref ref-type="bibr" rid="ref-22">22</xref>].</p>
<sec id="s2_1"><label>2.1</label><title>State of the Art for Type 3 Clones</title>
<p>LVMapper was developed to detect large variance codes i.e., clones with relatively more differences in large source code repositories. It specifically considers the modifications that are more scattered in large codes. LVMapper makes use of seeds (small windows of continuous lines) to located and filter the candidate pairs of code clones [<xref ref-type="bibr" rid="ref-9">9</xref>]. SourcererCC is a technique based on token level granularity that uses an index to achieve scalability. SourcererCC has a precision of 86&#x0025; and a recall rate of (86&#x0025; &#x2013; 100&#x0025;) on 250MLOC [<xref ref-type="bibr" rid="ref-23">23</xref>]. CloneWorks has direct application in large-scale clone detection experiments. It can be fully customized to the user&#x0027;s need for representation of source code for clone detection [<xref ref-type="bibr" rid="ref-24">24</xref>]. NICAD is a lightweight clone detection approach that uses flexible pretty-printing and code normalization techniques. It uses agile parsing to remove noise and is-land grammars to select potential clones [<xref ref-type="bibr" rid="ref-25">25</xref>]. Deckard is based on the characterization of subtrees with numerical vectors and an algorithm w.r.t Euclidean distance matrix to cluster above said vectors [<xref ref-type="bibr" rid="ref-15">15</xref>].</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>State of the Art for Type 4 Clones</title>
<p>Jiang proposed a random number input approach to detect semantic clones. The key used by Jiang is of reducing code by using all possible consecutive subsequences of a code fragment [<xref ref-type="bibr" rid="ref-26">26</xref>]. Gabel proposed a scalable clone detection technique that reduces the difficult graph similarity problem to a tree similarity problem by carefully matching the Program dependency Graph(PDG) to their related structured syntax [<xref ref-type="bibr" rid="ref-27">27</xref>].</p>
</sec>
<sec id="s2_3"><label>2.3</label><title>Latest Work on Code Clone Detection</title>
<p>Twin-Finder proposed a novel closed-loop clone detection approach that uses symbolic execution and machine learning techniques to get better results. For reducing false positives TwinFinder uses a feedback loop for formal loops to tune the machine learning algorithm. It lays special focus on false positives and was able to eliminate 99.32, 89 &#x0026; 86.74&#x0025; of false positives in bzip2, thttpd &#x0026; Links respectively [<xref ref-type="bibr" rid="ref-14">14</xref>].</p>
<p>Oreo is a novel technique specifically designed for Type 4 clones that also exhibit some similarities syntax-wise. This category of clones is said to be in the Twilight zone. Oreo uses machine learning &#x0026; size similarity sharding to perform clone detection [<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<p>Clonmel proposed a solution to code clone detection problems via learning supervised deep features [<xref ref-type="bibr" rid="ref-28">28</xref>].</p>
</sec>
</sec>
<sec id="s3"><label>3</label><title>SSA-HIAST Framework</title>
<p>As per the literature studied by us, most of the code clone detection techniques are comprised of two major phases. Firstly, they use a suitable technique to convert the code fragments into a suitable representation state. And secondly, they deploy an appropriate code similarity detection algorithm to detect code clones as shown in <?A3B2 "fig1",5,"anchor"?><xref ref-type="fig" rid="fig-1">Fig 1</xref>.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Overall Process followed by existing techniques</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-1.png"/></fig>
<p>Our work is a first of its kind i.e., hybrid framework SSA-HIAST (Similarity-based self-adjusting Hash-inspired Abstract Syntax Tree) for code clone detection of Type 1, 2, 3 &#x0026; 4 clones of Python programming language.</p>
<p><bold>Code Repository</bold></p>
<p>For the implementation of the said framework, we have used 153 open-source codes from GitHub from different repositories. We then injected Type 1, 2, 3, and 4 code clones in these 153 codes manually. Three python programmers manually injected these code clones of different lengths and logic. The three programmers were given training before injecting the clones. Also, the results of the evaluations of one programmer are cross verified by the other two programmers. To check the accuracy of the detected clones, some of the clones that were put by the programmer were selected randomly to check out if they were detected by the programmer. The entire process took 18 months.</p>
<sec id="s3_1"><label>3.1</label><title>Phase 1. Intermediate Code Representation</title>
<p>We use Abstract Syntax trees as the basic structure for intermediate code representation. ASTs represent the logical structure of source code and are created from a token stream. <?A3B2 "fig2",5,"anchor"?><xref ref-type="fig" rid="fig-2">Fig. 2</xref> represents a basic AST for a sample python code. According to the best of our knowledge, there have not been any advancements to the core structure of the AST&#x0027;s. Hence, we introduce Hash-inspired AST (HIAST).</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>AST for a sample Python Code</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-2.png"/></fig>
<p>Firstly, every source code is transformed into a parse tree representation by using appropriate syntax grammar. The HIAST helps to process trees of the Python abstract syntax grammar. The algorithm helps to programmatically what the current grammar looks like. HIAST computes a hash(object) at every stage and stores the computed hash along with the node for further input to the code matching algorithm. It also builds a hash table of the hashes which would be later used to tune the HIAST in terms of height. Also, by using HIAST, the hash of the entire code will be generated. We just need to keep track of hash values and not the entire code. This in turn will help in better management of both the software and software clones. Each node of AST is traversed in preorder for attaching hash value to it. The pseudocode for the generation of HIAST is given below:</p>
<fig id="fig-16"><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-16.png"/></fig>
<p>&#x2002;&#x2002;&#x2002;Sample python code:</p>
<p>&#x2002;&#x2002;&#x2002;def max(firstNo, secondNo):</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;if (firstNo, secondNo):</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;return firstNo</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;else</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;return secondNo</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>Limitation of AST</title>
<p>For codes expanding to millions of lines, the height of an AST can be high which can lead to higher runtimes for code similarity detection. To achieve faster processing, we have used the concept of indexing every node in AST using a dedicated hash.</p>
</sec>
<sec id="s3_3"><label>3.3</label><title>Benefits of Self Adjust Feature in HIAST</title>
<p>Using HIAST along with rotations inspired from the AVL tree, the height of the tree will remain maintained at a certain level and will not increase unnecessarily. Due to this, the memory consumed will be lesser as compared to simple AST. Also, the time for comparisons will be less as unnecessary comparisons will reduce due to the reduction in height of the tree.</p>
</sec>
<sec id="s3_4"><label>3.4</label><title>Rules for Generating HIAST</title>
<p>Each node in a statement from a code fragment can be represented as a record in the following way:
<list list-type="bullet">
<list-item><p>operators: one field for an operator, remaining fields pointers to operands</p></list-item>
<list-item><p>mknode(operator, leftOperand, rightOperand) as shown in <?A3B2 "fig3",5,"anchor"?><xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p></list-item>
<list-item><p>Number/String: one field with label &#x201C;num&#x201D;/&#x201D;str&#x201D; and a pointer to keep the value of the number mkleaf(num/str, val) as mentioned in <?A3B2 "fig4",5,"anchor"?><xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p></list-item>
<list-item><p>Looping construct (for, while): one field for the type of looping construct and remaining field pointers for number assignment, condition &#x0026; operator as in <?A3B2 "fig5",5,"anchor"?><xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p></list-item>
<list-item><p>Condition (if, else): one field for condition and remaining field pointers to condition variables as shown in <?A3B2 "fig6",5,"anchor"?><xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p></list-item>
<list-item><p>Data Structure (List, Set, Tuple, Dictionary &#x0026; Array): One field for size (n) and next (n) fields for values.</p></list-item>
<list-item><p>Function call: one field for the function name and remaining pointers for all the arguments as mentioned in <?A3B2 "fig7",5,"anchor"?><xref ref-type="fig" rid="fig-7">Fig. 7</xref></p></list-item>
<list-item><p>Class/Object: One field containing the object of the class.</p></list-item>
<list-item><p>hashId: one field with label &#x201C;hashId&#x201D; and pointer to store the hash of the node currently being created: mkleaf(hashId, hashval).</p></list-item>
</list>
<fig id="fig-3"><label>Figure 3</label><caption><title>AST representation for operators</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-3.png"/></fig>
<fig id="fig-4"><label>Figure 4</label><caption><title>AST representations for literals</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-4.png"/></fig>
<fig id="fig-5"><label>Figure 5</label><caption><title>AST representation for looping constructs</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-5.png"/></fig>
<fig id="fig-6"><label>Figure 6</label><caption><title>AST representation for conditions</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-6.png"/></fig>
<fig id="fig-7"><label>Figure 7</label><caption><title>AST representation for function calls</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-7.png"/></fig>
For example, the statement &#x201C;expression&#x2009;&#x003D;&#x2009;6&#x2009;&#x002B;&#x2009;8&#x201D; would be first converted to tokens as shown in <?A3B2 "fig8",5,"anchor"?><xref ref-type="fig" rid="fig-8">Fig. 8</xref> when passed to HIAST algorithm will give the AST tree as:</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>Token representation for &#x201C;expression&#x2009;&#x003D;&#x2009;6&#x2009;&#x002B;&#x2009;8</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-8.png"/></fig>
<p>The following sequence of function calls creates AST for - expression&#x2009;&#x003D;&#x2009;6&#x2009;&#x002B;&#x2009;8 as shown in <?A3B2 "fig9",5,"anchor"?><xref ref-type="fig" rid="fig-9">Fig. 9</xref>.</p>
<p>P1&#x003D; mkleaf(num, 6)</p>
<p>P2&#x003D; mkleaf(num, 8)</p>
<p>P3&#x003D; mknode(&#x002B;, P1, P2)</p>
<p>P4&#x003D; mkleaf(store,P3)</p>
<fig id="fig-9"><label>Figure 9</label><caption><title>HIAST representation for &#x201C;expression&#x2009;&#x003D;&#x2009;6&#x2009;&#x002B;&#x2009;8&#x201D;</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-9.png"/></fig>
<p>The pseudocode for the generation of AST is given below. It scans the code and inserts a node in the tree depending on the type of token.
</p>
<fig id="fig-17"><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-17.png"/></fig>
</sec>
<sec id="s3_5"><label>3.5</label><title>Phase 2: Code Similarity Detection</title>
<p>Once the codes to be checked are in suitable representation, we then apply an effective (high recall &#x0026; precision) code similarity detection algorithm. To the best of our knowledge, we are the first ones to use the &#x201C;Self-adjusting&#x201D; feature in AST&#x0027;s using a similarity score. Large codes can generate AST&#x0027;s that are consume a lot of memory due to the high depth of the tree generated.</p>
<p>Large code systems have the same code clones used in multiple parts of the codes. With this as a motivation, we decided to restructure a code by adjusting similar code fragments up or down the order in the original file. Most of the previous work applied similarity detection on code fragments individually. We here introduce a novel technique to apply similarity detection of a code file instead of a code fragment. Also, there are few existing techniques on code file similarity, but the proposed work focuses on large code systems with an efficient similarity detection approach.</p>
<sec id="s3_5_1"><label>3.5.1</label><title>Similarity Metric</title>
<p>Given two code files c1 &#x0026; c2, the similarity between the two code fragments representing a subtree of AST is defined in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">u</mml:mi><mml:mi mathvariant="normal">b</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mspace width="thickmathspace" /><mml:mi mathvariant="normal">c</mml:mi></mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">u</mml:mi><mml:mi mathvariant="normal">b</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mspace width="thickmathspace" /><mml:mi mathvariant="normal">c</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">N</mml:mi></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">N</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">L</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">R</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where &#x201C;SN&#x201D; is the number of shared nodes in subtree T1 &#x0026; T2 ;&#x007B;L:[t1, t2 &#x2026;. tn] , R[t1, t2&#x2026;&#x2026;.tn]&#x007D;</p>
</sec>
<sec id="s3_5_2"><label>3.5.2</label><title>Syntactic Similarity</title>
<p>A similarity detection algorithm requires a Threshold function along with a similarity detection algorithm. The threshold function is used to decide the optimum level for similarity check as shown in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">h</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:mo stretchy="false">&#x221A;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">C</mml:mi><mml:mi mathvariant="normal">i</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">C</mml:mi><mml:mi mathvariant="normal">j</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /><mml:mo>&#x2217;</mml:mo><mml:mspace width="thickmathspace" /></mml:mrow><mml:mn>2</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
Our self-tree readjusting algorithm performs necessary rotations on the HIAST to effectively compare two subtrees for clones. Rotations are performed based on values of threshold (Th). An upper limit has been set for the number of rotations to be performed to prevent resource exhaustion. With &#x201C;label&#x201D; fields in the node structure, we can bring two subtrees under comparison to the same level by performing certain rotations.</p>
<p>We have used Latent Semantic Indexing [<xref ref-type="bibr" rid="ref-29">29</xref>] based on the Euclidean distance between two vectors to cluster a vector group given a set of characterstic vectors. Assume two feature vectors FVeci and FVecj each represents two code snippets CSi and CSj. Size(CSi) and Size(CSj) represent the code size (the total number of AST nodes). E is the euclidean distance between FVeci and FVecj ([FVeci; FVecj]). Given a feature vector group VG, the threshold may be reduced to&#x221A;(min<italic>f</italic>()(Sim(Ci), Sim(Cj)&#x002A;2(1-Sim))), where vector sizes are used to estimate tree sizes. The Sim is the code similarity measure given by <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>. Thus, if E([Veci; Vecj ]&#x003C;&#x003D;Th, code fragments CSi and CSj will be grouped as code clones under a certain code similarity Sim. There are four types of rotations that are used in the framework. The pseudocode for the rotation set used is given below.</p>
<fig id="fig-18"><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-18.png"/></fig>
<p>The pseudocode for the similarity algorithm used is given below. First code files are scanned. Then they are tokenized. After tokenization, the HIAST of the two code files is generated and compared based on similarity metric and threshold.</p>
<fig id="fig-19"><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-19.png"/></fig>
<p>The example of two codes for comparison is given in <?A3B2 "tbl2",5,"anchor"?><xref ref-type="table" rid="table-2">Tab. 2</xref>. The similarity of code 1 and code 2 is explained with the help of rotations in <?A3B2 "fig10",5,"anchor"?><xref ref-type="fig" rid="fig-10">Fig. 10</xref>.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Example Type-4 Clone</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Code 1: a - 4 &#x002B; c</th>
<th align="left">Code 2: a &#x002B; c - 4</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">P1&#x003D; mkleaf(id, a)<break/>P2&#x003D; mkleaf(num, 4)<break/>P3&#x003D; mknode(-,P1,P2)<break/>P4&#x003D; mkleaf(id, c)<break/>P5&#x003D; mknode(&#x002B;,P3,P4)</td>
<td align="left">P1&#x003D; mkleaf(id, a)<break/>P2&#x003D; mkleaf(id, c)<break/>P3&#x003D; mknode(&#x002B;,P1,P2)<break/>P4&#x003D; mkleaf(num, 4)<break/>P5&#x003D; mknode(-,P3,P4)</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-10"><label>Figure 10</label><caption><title>Rotation being performed in a HIAST</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-10.png"/></fig>
</sec>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Experimentation and Results</title>
<p>For experimentation, we have used 153 python codes from different publicly available repositories on GitHub. This section presents a detailed analysis of the results of the SSA-HIAST approach. A similarity detection algorithm requires a threshold value along with a similarity detection algorithm. The threshold value is used to decide the optimum level for similarity check. Similarity detection algorithm requires a Similarity threshold for comparing with similarity metric Sim. If the value of Sim comes out to be greater than Similarity Threshold, then we have a match otherwise we don&#x0027;t have a match. In our framework, the two codes are clones if they are 90&#x0025; similar. Also, Rotation Threshold is required to prevent the algorithm from going into an infinite loop. Our self-tree readjusting algorithm performs necessary rotations on the HIAST to effectively compare two subtrees for clones. With &#x201C;label&#x201D; fields in the node structure, we can bring two subtrees under comparison to the same level by performing certain rotations. Our self-tree readjusting algorithm performs necessary rotations on the HIAST to effectively compare two subtrees for clones. Rotations are performed based on values of threshold (Th). An upper limit has been set for the number of rotations to be performed to prevent resource exhaustion.</p>
<sec id="s4_1"><label>4.1</label><title>Experimental Setup</title>
<p>In all the 153 python codes, three different python programmers injected code clone fragments that had a variable function and class name modifications. Furthermore, for, while &#x0026; if statements were changed to their synonym&#x0027;s expressions and useless line (600&#x2013;2000) were added. The system used to detect clones was the Intel i5(2.7Ghz) based machine with 16GB of RAM running Ubuntu 18.04 LTS.</p>
</sec>
<sec id="s4_2"><label>4.2</label><title>Evaluation Criteria</title>
<p>The framework is evaluated on the basis of various parameters discussed below:
<list list-type="bullet">
<list-item><p><bold>Clone Quantity</bold>: No of detected clones</p></list-item>
<list-item><p><bold>Clone Quality</bold>: No of false positives</p></list-item>
<list-item><p><bold>Precision</bold>: The ratio of true positives to all positives is known as precision. In our case, it is the number of clones that are correctly identified out of all the clones present.
Precision&#x003D;True Positive/ (True Positive &#x002B;False Positive)</p></list-item>
<list-item><p><bold>Recall:</bold> It is a test of how well our model detects True Positives. In our case, it is all the clones present, to the how many correctly identified clones are there.
Recall&#x003D;True Positive/ (True Positive &#x002B;False Negative)</p></list-item>
</list></p>
<p>Based on evaluation criteria, the results are depicted in <?A3B2 "fig11",5,"anchor"?><?A3B2 "fig12",5,"anchor"?><?A3B2 "fig13",5,"anchor"?><xref ref-type="fig" rid="fig-11">Figs. 11</xref>&#x2013;<xref ref-type="fig" rid="fig-14">14</xref>, as the comparison of injected clone with detected clone in a program is shown in the case of Type 1, Type 2, Type 3 and Type 4 clones respectively, which directly depicts the accuracy and efficiency of the model in detecting the clones in a program. The results obtained after the application of our framework for software clone detection for type 1 clones can be seen in <xref ref-type="fig" rid="fig-11">Fig. 11</xref>. There is a slight difference between the peaks of injected and detected clones as can be seen in <xref ref-type="fig" rid="fig-11">Fig. 11</xref>. Our proposed framework has shown impressive results for Type-1 clone detection. The accuracy achieved for Type-1 clone detection is 97.23&#x0025;. The framework can detect most of the type-1clones correctly. Similarly, the results for Type-2 clone detection are summarized in <xref ref-type="fig" rid="fig-12">Fig. 12</xref>. Although the proposed framework has shown good results for Type-2 clone detection also, the accuracy achieved for Type-2 clone detection is lesser as compared to Type-1 clone detection. The accuracy achieved for Type-2 clone detection is 96.74&#x0025;.</p>
<p>Type-3 clones, also known as near-miss are a bit difficult to identify as compared to Type-1 and Type-2 clones. Result analysis for Type-3 clone detection is shown in <xref ref-type="fig" rid="fig-13">Fig. 13</xref>. The accuracy achieved is also lesser compared to the detection of Type-1 and Type-2 clones. The accuracy achieved is 95.52&#x0025;. The Type-4 clones are the most difficult to detect and manage as they are based on semantic similarity of code. The accuracy achieved by using the proposed framework for Type-4 clone detection is 92.80&#x0025;. The difference between the peaks of injected and detected clones is also largest for Type-4 clone detection as evident from <xref ref-type="fig" rid="fig-14">Fig. 14</xref>.</p>
<fig id="fig-11"><label>Figure 11</label><caption><title>Comparison Injected v/s Detected Type 1 clones</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-11.png"/></fig>
<fig id="fig-12"><label>Figure 12</label><caption><title>Comparison of Injected v/s Detected Type 2 clones</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-12.png"/></fig>
<p>Moreover, the proposed algorithm is compared with the classic algorithms in <?A3B2 "tbl3",5,"anchor"?><xref ref-type="table" rid="table-3">Tab. 3</xref> in terms of space and time complexity. It can be seen from the table that the space complexity of CP-Miner is directly proportional to the number of lines of code, whereas the space complexity of the proposed SSA-HIAST algorithm is directly dependent on the number of nodes of the tree. Hence the proposed algorithm is better than CP-Miner in terms of space utilization. Also, CP-Miner has quadratic run time complexity, whereas SSA-HIAST has linear run time complexity.</p>
<fig id="fig-13"><label>Figure 13</label><caption><title>Comparison of Injected v/s Detected Type 3 clones</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-13.png"/></fig>
<fig id="fig-14"><label>Figure 14</label><caption><title>Comparison of Injected v/s Detected Type 4 clones</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-14.png"/></fig>
<p>The precision and recall values for different clone types are given in <?A3B2 "fig15",5,"anchor"?><xref ref-type="fig" rid="fig-15">Fig. 15</xref>. The precision and recall for type-1 clone detection are 97.52&#x0025; and 94.93&#x0025;, for type -2 precision and recall values are 96&#x0025; and 92.8&#x0025; respectively which are comparatively lesser as compared to type-1. For type-3 clone detection, the proposed framework has achieved precision and recall of 95.9&#x0025; and 91.2&#x0025; respectively. The least precision and recall values have been achieved for type-4 clone detection which is 94.5&#x0025; and 87.6&#x0025; respectively.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Comparative Analysis</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Algorithm</th>
<th align="left">Space complexity</th>
<th align="left">Run time complexity</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">CP-Miner [<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td align="left">O(r)</td>
<td align="left">O(r<sup>2</sup>)</td>
</tr>
<tr>
<td align="left">Clone DR [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td align="left">O(s)</td>
<td align="left">O(r<sup>2</sup>/&#x007C;Buckets&#x007C;)</td>
</tr>
<tr>
<td align="left">LSH [<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td align="left">O(s<sup>p&#x002B;1</sup> &#x002B; ks)</td>
<td align="left">O(ks<sup>p</sup> logs)</td>
</tr>
<tr>
<td align="left">LSH w/grouping [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td align="left">O(max<sub>v</sub><italic><sub>&#x2208;G &#x007C;v&#x007C;</sub><sup>p&#x002B;1</sup> &#x002B; k&#x007C;v&#x007C;</italic> )</td>
<td align="left">O(k&#x2211;v<italic>&#x2208;G &#x007C;v&#x007C;<sup>p</sup> log&#x007C;v&#x007C;)</italic></td>
</tr>
<tr>
<td align="left">DECKARD w/Post-Processing [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td align="left"><italic>max&#x007B;O(c&#x007C;rcAN&#x007C;), O</italic> <sub>v</sub><italic><sub>&#x2208;G</sub> </italic><sub>,</sub>(<sub>&#x007C;v&#x007C;</sub><sup>p&#x002B;1</sup> &#x002B; k&#x007C;v&#x007C;)&#x007D;</td>
<td align="left"><italic>O</italic>(<italic>s</italic> &#x002B; <italic>k&#x2211;v&#x2208;G&#x007C;v&#x007C;<sup>&#x03C1;</sup></italic><sup>&#x002B;1</sup> log&#x007C;v&#x007C; &#x002B; <italic>c&#x007C;rcAN&#x007C;</italic><sup>2</sup>)</td>
</tr>
<tr>
<td align="left">DP-matching [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td align="left">O(max<sub>v</sub><italic><sub>&#x2208;G &#x007C;v&#x007C;</sub><sup>p&#x002B;1</sup> &#x002B; k&#x007C;v&#x007C;</italic> )</td>
<td align="left">O(k&#x2211;v<italic>&#x2208;G &#x007C;v&#x007C;<sup>p</sup> log&#x007C;v&#x007C;)</italic></td>
</tr>
<tr>
<td align="left">Event checking [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td align="left">O(s<sup>p&#x002B;1</sup> &#x002B; ks)</td>
<td align="left">O(ks<sup>p</sup> logs)</td>
</tr>
<tr>
<td align="left">Normalisation pipeline [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td align="left">O(s<sup>p&#x002B;1</sup> &#x002B; ks)</td>
<td align="left">O(ks<sup>p</sup> logs)</td>
</tr>
<tr>
<td align="left">Context-sensitive pointer analysis [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td align="left">s O(n&#x03B1;(n, n))</td>
<td align="left">O(n) s</td>
</tr>
<tr>
<td align="left">SourcererCC [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td align="left">O(n<sup>2</sup>)</td>
<td align="left">O(n<sup>2</sup>)</td>
</tr>
<tr>
<td align="left">Autoencode [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td align="left">O(n<sup>2</sup>)</td>
<td align="left">O(n<sup>2</sup>)</td>
</tr>
<tr>
<td align="left">SSA-HIAST</td>
<td align="left">O(s)</td>
<td align="left">O(s &#x002B; log&#x007C;Buckets&#x007C;)</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-15"><label>Figure 15</label><caption><title>Precision <italic>vs.</italic> Recall in Detecting Different Clone Types</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_22659-fig-15.png"/></fig>
<p>Worst-case complexities of CloneDR, CP-Miner, and DECKARD (<italic>r</italic> is the number of lines of code, <italic>s</italic> is the size of a parse tree, <italic>&#x007C;Buckets&#x007C;</italic> is the number of hash tables used in CloneDR, <italic>k</italic> is the number of node kinds, <italic>&#x007C;v&#x007C;</italic> is the size of a vector group, 0 <italic>&#x003C; &#x03C1; &#x003C;</italic> 1, <italic>c</italic> is the number of clone classes reported, and <italic>&#x007C;rcAN&#x007C;</italic> is the average size of the clone classes).</p>
</sec>
<sec id="s4_3"><label>4.3</label><title>Benchmarking Against the State of the Art</title>
<p>The benchmark SSA-HIAST is compared with the state of art and the results are shown in <?A3B2 "tbl4",5,"anchor"?><xref ref-type="table" rid="table-4">Tab. 4</xref>. Along with this, a comparison table has also been developed for comparing performance metrics of the proposed algorithm with some of the pre-existing models as shown in <?A3B2 "tbl5",5,"anchor"?><xref ref-type="table" rid="table-5">Tab. 5</xref> and found that the proposed model outperforms all other models in terms of performance metrics also.</p>
<table-wrap id="table-4"><label>Table 4</label><caption><title>Benchmarking SSA-HIAST</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Benchmark</th>
<th align="left">Deckard<break/>&#x0025;age clone detection</th>
<th align="left">Twin-Finder<break/>&#x0025;age clone detection</th>
<th align="left">SSA- HIAST<break/>&#x0025;age clone detection</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">bzip2</td>
<td align="left">32.5</td>
<td align="left">62.15</td>
<td align="left">68.75</td>
</tr>
<tr>
<td align="left">sphinx3</td>
<td align="left">32.28</td>
<td align="left">75.89</td>
<td align="left">77.85</td>
</tr>
<tr>
<td align="left">hmmer</td>
<td align="left">27.19</td>
<td align="left">59.55</td>
<td align="left">62.24</td>
</tr>
<tr>
<td align="left">Thhtpd</td>
<td align="left">29.13</td>
<td align="left">51.64</td>
<td align="left">64.78</td>
</tr>
<tr>
<td align="left">Gzip</td>
<td align="left">9.57</td>
<td align="left">40.15</td>
<td align="left">39.84</td>
</tr>
<tr>
<td align="left">Man</td>
<td align="left">14.74</td>
<td align="left">49.08</td>
<td align="left">55.47</td>
</tr>
<tr>
<td align="left">Links</td>
<td align="left">22.69</td>
<td align="left">64.71</td>
<td align="left">64.5</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-5"><label>Table 5</label><caption><title>Comparison of the proposed model with pre-existing models in terms of performance metrics</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Study</th>
<th align="left">KLOC</th>
<th align="left">Precision(&#x0025;)</th>
<th align="left">Recall(&#x0025;)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Dup [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td align="left">27</td>
<td align="left">80</td>
<td align="left">80</td>
</tr>
<tr>
<td align="left">CCFinder [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td align="left">21</td>
<td align="left">99</td>
<td align="left">93</td>
</tr>
<tr>
<td align="left">Duploc [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td align="left">23</td>
<td align="left">90</td>
<td align="left">86</td>
</tr>
<tr>
<td align="left">DP matching [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td align="left">28</td>
<td align="left">87</td>
<td align="left">83</td>
</tr>
<tr>
<td align="left">SourcererCC [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td align="left">26</td>
<td align="left">82</td>
<td align="left">79</td>
</tr>
<tr>
<td align="left">Autoencode [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td align="left">30</td>
<td align="left">81</td>
<td align="left">76</td>
</tr>
<tr>
<td align="left">Proposed algorithm</td>
<td align="left">35</td>
<td align="left">99</td>
<td align="left">95</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion and Future Scope</title>
<p>The proposed system SSA-HIAST has achieved higher clone detection rates than the popular and established Deckard and Twin Finder clone detection techniques. For comparison of the clone detection rate, various projects have been considered. The proposed system has surpassed clone detection techniques in almost all projects. Moreover, the proposed algorithm has been tested on 153 python codes that have been publicly taken from the GitHub repositories. The results have been evaluated on the criteria of the number of false positives as well as the number of clones detected. The proposed algorithm can detect type-4 clones with an accuracy of 92.8&#x0025;. The space complexity of the proposed algorithm is O(s) where s is the number of nodes of HIAST and the runtime complexity of our algorithm is (O(r&#x002B;slog(buckets)). The proposed Framework outperforms other works like Dup &#x0026; Duploc in terms of precision and recall and CCFinder in terms of recall. In the future, we are planning to conduct the same experiment using a hybrid deep learning approach, by combining two or more techniques for code clone detection and management.</p>
<p>Along with this, we will focus to extend this work to some other programming languages like Java, R, and C as this framework works for python language only. After detecting and prioritizing true clones, this framework can be further strengthened by employing clone eradication strategies.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> The authors received no specific funding for this study.</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Sheneamer</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Roy</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Kalita</surname></string-name></person-group>, &#x201C;<article-title>A detection framework for semantic code clones and obfuscated code</article-title>,&#x201D; <source>Expert Systems with Applications</source>, vol. <volume>97</volume>, pp. <fpage>405</fpage>&#x2013;<lpage>420</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Rattan</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Bhatia</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Singh</surname></string-name></person-group>, &#x201C;<article-title>Software clone detection: A systematic review</article-title>,&#x201D; <source>Information and Software Technology</source>, vol. <volume>55</volume>, no. <issue>7</issue>, pp. <fpage>1165</fpage>&#x2013;<lpage>1199</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Akram</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Mumtaz</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Luo</surname></string-name></person-group>, &#x201C;<article-title>DCCD: An efficient and scalable distributed code clone detection technique for Big code</article-title>,&#x201D; in <conf-name>30th Int. Conf. on Software Engineering and Knowledge Engineering</conf-name>, <conf-loc>Redwood City,California, USA</conf-loc>,
 pp. <fpage>354</fpage>&#x2013;<lpage>360</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Kodhai</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Kanmani</surname></string-name></person-group>, &#x201C;<article-title>Method-level code clone detection through LWH (Light weight hybrid) approach</article-title>,&#x201D; <source>Journal of Software Engineering Research and Development</source>, vol. <volume>2</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>29</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Q. U.</given-names> <surname>Ain</surname></string-name>, <string-name><given-names>F. A. W.</given-names> <surname>Haider</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Butt</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Anwar</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Maqbool</surname></string-name></person-group>, &#x201C;<article-title>A systematic review on code clone detection</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>86121</fpage>&#x2013;<lpage>86144</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Cuomo</surname></string-name>, <string-name><given-names>U.</given-names> <surname>Villano</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Santone</surname></string-name></person-group>, &#x201C;<article-title>A novel approach based on formal methods for clone detection</article-title>,&#x201D; in <conf-name>Int. Workshop on Software Clones (IWSC)</conf-name>, <conf-loc>Zurich, Switzerland</conf-loc>, pp. <fpage>8</fpage>&#x2013;<lpage>14</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>White</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Tufano</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Vendome</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Poshyvanyk</surname></string-name></person-group>, &#x201C;<article-title>Deep learning code fragments for code clone detection</article-title>,&#x201D; in <conf-name>31st IEEE/ACM Int. Conf. on Automated Software Engineering (ASE)</conf-name>, <conf-loc>Singapore</conf-loc>, pp. <fpage>87</fpage>&#x2013;<lpage>98</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Tekchandani</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Bhatia</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Singh</surname></string-name></person-group>, &#x201C;<article-title>Code clone genealogy detection on e-health system using hadoop</article-title>,&#x201D; <source>Computers and Electrical Engineering</source>, vol. <volume>61</volume>, pp. <fpage>15</fpage>&#x2013;<lpage>30</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Yin</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Xu</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>LVMapper: A large-variance clone detector using sequencing alignment approach</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>8</volume>, pp. <fpage>27986</fpage>&#x2013;<lpage>27997</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Saini</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Farmahinifarahani</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Baldi</surname></string-name> and <string-name><given-names>C. V.</given-names> <surname>Lopes</surname></string-name></person-group>, &#x201C;<article-title>Oreo: Detection of clones in the twilight zone</article-title>,&#x201D; in <conf-name>26th ACM Joint Meeting on European Software Engineering Conf. and Symposium on the Foundations of Software Engineering</conf-name>, <conf-loc>New York, United States</conf-loc>, pp. <fpage>354</fpage>&#x2013;<lpage>365</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Kamiya</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kusumoto</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Inoue</surname></string-name></person-group>, &#x201C;<article-title>CCFinder: A multilinguistic token-based code clone detection system for large scale source code</article-title>,&#x201D; <source>IEEE Transcations on Software Engineering</source>, vol. <volume>28</volume>, no. <issue>7</issue>, pp. <fpage>654</fpage>&#x2013;<lpage>670</lpage>, <year>2002</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Kim</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Woo</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Oh</surname></string-name></person-group>, &#x201C;<article-title>VUDDY: A scalable approach for vulnerable code clone discovery</article-title>,&#x201D; in <conf-name>IEEE Symposium on Security and Privacy</conf-name>, <conf-loc>San Francisco, CA, USA</conf-loc>, pp. <fpage>595</fpage>&#x2013;<lpage>614</lpage>, <year>2017</year></mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Keivanloo</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Rilling</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Charland</surname></string-name></person-group>, &#x201C;<article-title>Seclone - A hybrid approach to internet-scale real-time code clone search</article-title>,&#x201D; in <conf-name>19th IEEE Int. Conf. on Program Comprehension</conf-name>, <conf-loc>Kingston, Ontario, Canada</conf-loc>, pp. <fpage>223</fpage>&#x2013;<lpage>224</lpage>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Xue</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Venkataramani</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Lan</surname></string-name></person-group>, &#x201C;<article-title>Twin-finder: Integrated reasoning engine for pointer-related code clone detection</article-title>,&#x201D; <source>in ArXiv Prep</source>, pp. <fpage>1</fpage>&#x2013;<lpage>7</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Misherghi</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Su</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Glondu</surname></string-name></person-group>, &#x201C;<article-title>DECKARD: Scalable and accurate tree-based detection of code clones</article-title>,&#x201D; in <conf-name>29th Int. Conf. on Software Engineering (ICSE&#x2019;07)</conf-name>, <conf-loc>Minneapolis, MN</conf-loc>,
 pp. <fpage>96</fpage>&#x2013;<lpage>105</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Dura&#x010D;&#x00ED;k</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Kr&#x0161;&#x00E1;k</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Hrk&#x00FA;t</surname></string-name></person-group>, &#x201C;<article-title>Current trends in source code analysis, plagiarism detection and issues of analysis&#x030C; Big datasets</article-title>,&#x201D; <source>Procedia Engineering</source>, vol. <volume>192</volume>, pp. <fpage>136</fpage>&#x2013;<lpage>141</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Tekchandani</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Bhatia</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Singh</surname></string-name></person-group>, &#x201C;<article-title>Semantic code clone detection for internet of things applications using reaching definition and liveness analysis</article-title>,&#x201D; <source>Journal of Supercomputing</source>, vol. <volume>74</volume>, no. <issue>9</issue>, pp. <fpage>4199</fpage>&#x2013;<lpage>4226</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Keivanloo</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zou</surname></string-name></person-group>, &#x201C;<article-title>Threshold-free code clone detection for a large-scale heterogeneous java repository</article-title>,&#x201D; in <conf-name>22nd Int. Conf. on Software Analysis, Evolution and Reengineering (SANER)</conf-name>, <conf-loc>Montreal, QC, Canada</conf-loc>, pp. <fpage>201</fpage>&#x2013;<lpage>210</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>I. G.</given-names> <surname>Anil</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Reddy</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Govardhan</surname></string-name></person-group>, &#x201C;<article-title>Software code clone detection using ast</article-title>,&#x201D; <source>International Journal of P2P Network Trends and Technology</source>, vol. <volume>4</volume>, no. <issue>3</issue>, pp. <fpage>33</fpage>&#x2013;<lpage>39</lpage>, <year>June 2014</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. W.</given-names> <surname>Son</surname></string-name>, <string-name><given-names>T. G.</given-names> <surname>Noh</surname></string-name>, <string-name><given-names>H. J.</given-names> <surname>Song</surname></string-name> and <string-name><given-names>S. B.</given-names> <surname>Park</surname></string-name></person-group>, &#x201C;<article-title>An application for plagiarized source code detection based on a parse tree kernel</article-title>,&#x201D; <source>Engineering Applications of Artificial Intelligence</source>, vol. <volume>26</volume>, no. <issue>8</issue>, pp. <fpage>1911</fpage>&#x2013;<lpage>1918</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. A.</given-names> <surname>Nishi</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Damevski</surname></string-name></person-group>, &#x201C;<article-title>Scalable code clone detection and search based on adaptive prefix filtering</article-title>,&#x201D; <source>Journal of Systems and Software</source>, vol. <volume>137</volume>, pp. <fpage>130</fpage>&#x2013;<lpage>142</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>Mi</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Keung</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Mensah</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Gao</surname></string-name></person-group>, &#x201C;<article-title>Improving code readability classification using convolutional neural networks</article-title>,&#x201D; <source>Information and Software Technology</source>, vol. <volume>104</volume>, pp. <fpage>60</fpage>&#x2013;<lpage>71</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Saini</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Sajnani</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Kim</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Lopes</surname></string-name></person-group>, &#x201C;<article-title>SourcererCC and SourcererCC-i: Tools to detect clones in batch mode and during software development</article-title>,&#x201D; in <conf-name>38th Int. Conf. on Software Engineering Companion (ICSE-C)</conf-name>, <conf-loc>Austin, TX, USA</conf-loc>, pp. <fpage>597</fpage>&#x2013;<lpage>600</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Svajlenko</surname></string-name> and <string-name><given-names>C. K.</given-names> <surname>Roy</surname></string-name></person-group>, &#x201C;<article-title>Fast and flexible large-scale clone detection with cloneworks</article-title>,&#x201D; in <conf-name>IEEE/ACM 39th Int. Conf. on Software Engineering Companion</conf-name>, <conf-loc>Buenos, Aires, Argentina</conf-loc>, pp. <fpage>27</fpage>&#x2013;<lpage>30</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. K.</given-names> <surname>Roy</surname></string-name> and <string-name><given-names>J. R.</given-names> <surname>Cordy</surname></string-name></person-group>, &#x201C;<article-title>NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization</article-title>,&#x201D; in <conf-name>IEEE Int. Conf. on Program Comprehension</conf-name>, <conf-loc>Amsterdam, Netherlands</conf-loc>, pp. <fpage>172</fpage>&#x2013;<lpage>181</lpage>, <year>2008</year></mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Jiang</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Su</surname></string-name></person-group>, &#x201C;<article-title>Automatic mining of functionally equivalent code fragments via random testing</article-title>,&#x201D; in <conf-name>18th Int. Symposium on Software Testing and Analysis</conf-name>, <conf-loc>Amsterdam</conf-loc>,
 pp. <fpage>81</fpage>&#x2013;<lpage>91</lpage>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Gabel</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jiang</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Su</surname></string-name></person-group>, &#x201C;<article-title>Scalable detection of semantic clones</article-title>,&#x201D; in <conf-name>Proc. of the 30th Int. Conf. on Software Engineering</conf-name>, <conf-loc>Leipzig, Germany</conf-loc>, pp. <fpage>321</fpage>&#x2013;<lpage>330</lpage>, <year>2008</year>. </mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H. H.</given-names> <surname>Wei</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code</article-title>,&#x201D; in <conf-name>Int. Joint Conf. on Artificial Intelligence (IJCAI-17)</conf-name>, <conf-loc>Melbourne, Australia</conf-loc>, pp. <fpage>3034</fpage>&#x2013;<lpage>3040</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Dumais</surname></string-name></person-group>, &#x201C;<article-title>Latent semantic indexing (LSI) and TREC-2</article-title>,&#x201D; <source>Nist Special Publication Sp</source>, pp. <fpage>105</fpage>&#x2013;<lpage>105</lpage>, <year>1994</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Myagmar</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>CP-Miner: Finding copy-paste and related bugs in large-scale software code</article-title>,&#x201D; <source>IEEE Transactions on Software Engineering</source>, vol. <volume>32</volume>, no. <issue>3</issue>, pp. <fpage>176</fpage>&#x2013;<lpage>192</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>I. D.</given-names> <surname>Baxter</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Yahin</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Moura</surname></string-name>, <string-name><given-names>M. S.</given-names> <surname>Anna</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Bier</surname></string-name></person-group>, &#x201C;<article-title>Clone detection using abstract syntax suffix trees</article-title>,&#x201D; in <conf-name>13th Working Conf. on Reverse Engineering</conf-name>, <conf-loc>Benevento, Italy</conf-loc>, pp. <fpage>253</fpage>&#x2013;<lpage>262</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Datar</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Indyk</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Immorlica</surname></string-name> and <string-name><given-names>V. S.</given-names> <surname>Mirrokni</surname></string-name></person-group>, &#x201C;<article-title>Locality-sensitive hashing scheme based on p-stable distributions</article-title>,&#x201D; in <conf-name>Twentieth Annual Symposium on Computational Geometry</conf-name>, <conf-loc>Brooklyn, New York, USA</conf-loc>, pp. <fpage>253</fpage>&#x2013;<lpage>262</lpage>, <year>2004</year>. </mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Gionis</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Indyk</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Motwani</surname></string-name></person-group>, &#x201C;<article-title>Similarity search in high dimensions via hashing</article-title>,&#x201D; in <conf-name>25th Int. Conf. on Very Large Data Bases</conf-name>, <conf-loc>Edinburgh, Scotland</conf-loc>, pp. <fpage>518</fpage>&#x2013;<lpage>529</lpage>, <year>1999</year>. </mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Lavoie</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Eilers-Smith</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Merlo</surname></string-name></person-group>, &#x201C;<article-title>Challenging cloning related problems with gpu-based algorithms</article-title>,&#x201D; in <conf-name>Int. Workshop on Software Clones</conf-name>, <conf-loc>Cape Town, South Africa</conf-loc>, pp. <fpage>25</fpage>&#x2013;<lpage>32</lpage>, <year>2010</year>. </mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>S. J.</given-names> <surname>Turner</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>B. P.</given-names> <surname>Gan</surname></string-name> and <string-name><given-names>M. Y. H.</given-names> <surname>Low</surname></string-name></person-group>, &#x201C;<article-title>Algorithms for HLA-based distributed simulation cloning</article-title>,&#x201D; <source>ACM Transactions on Modeling and Computer Simulation</source>, vol. <volume>4</volume>, no. <issue>15</issue>, pp. <fpage>316</fpage>&#x2013;<lpage>345</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Juergens</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Hummel</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Deissenboeck</surname></string-name></person-group>, &#x201C;<article-title>Clonedetective-a workbench for clone detection research</article-title>,&#x201D; in <conf-name>Int. Conf. on Software Engineering</conf-name>, <conf-loc>Vancouver, BC, Canada</conf-loc>,
 pp. <fpage>603</fpage>&#x2013;<lpage>606</lpage>, <year>2009</year>. </mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Lattner</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Lenharth</surname></string-name> and <string-name><given-names>V.</given-names> <surname>Adve</surname></string-name></person-group>, &#x201C;<article-title>Making context-sensitive points-to analysis with heap cloning practical for the real world</article-title>,&#x201D; <source>ACM SIGPLAN Notices</source>, vol. <volume>6</volume>, no. <issue>42</issue>, pp. <fpage>278</fpage>&#x2013;<lpage>289</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Rahman</surname></string-name></person-group>, &#x201C;<article-title>Clone detection on large scale codebases</article-title>,&#x201D; in <conf-name>Int. Workshop on Software Clones</conf-name>, <conf-loc>Ontario, Canada</conf-loc>, pp. <fpage>38</fpage>&#x2013;<lpage>44</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Shao</surname></string-name></person-group>, &#x201C;<article-title>Detecting duplications in sequence diagrams based on suffix trees</article-title>,&#x201D; in <conf-name>Asia Pacific Software Engineering Conf.</conf-name>, <conf-loc>Bangalore, India</conf-loc>, pp. <fpage>269</fpage>&#x2013;<lpage>276</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Kamiya</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kusumoto</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Inoue</surname></string-name></person-group>, &#x201C;<article-title>CCFinder: A multilinguistic token-based code clone detection system for large scale source code</article-title>,&#x201D; <source>IEEE Transactions on Software Engineering</source>, vol. <volume>28</volume>, no. <issue>7</issue>, pp. <fpage>654</fpage>&#x2013;<lpage>670</lpage>, <year>2002</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>