View on GitHub

OCD

OCD Framework: A Comparison of Code Similarity Analysers

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark

CREST centre, Department of Computer Science, University College London

Additional material to:

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark: A Comparison of Code Similarity Analysers. Empirical Software Engineering, 23, 2464–2519 (2018). DOI: 10.1007/s10664-017-9564-7

As an extension to:

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark: Similarity of Source Code in the Presence of Pervasive Modifications. 16th International Working Conference on Source Code Analysis and Manipulation (SCAM), Raleigh, NC, USA, October 2016.

Download the OCD Framework


Abstract

Source code analysis to detect code cloning, code plagiarism, and code reuse suffers from the problem of code modifications. We are interested in two types of code modifications in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local+global code modifications. Our experimental results show that with the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, while in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. Code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

Contents

  1. The Experimental Framework
  2. Tools and Techniques
  3. Scenario 1: Pervasive Modifications (data set)
  4. Scenario 2: Decompilation (data set)
  5. Scenario 3: Reused Boiler-Plate Code (data set)
  6. Scenario 4: Ranked Results
  7. Scenario 5: Local + Global Modifications (data set)
  8. Results
  9. Research Question 1 (Performance comparison)
  10. Research Question 2 (Optimal Configurations)
  11. Research Question 3 (Normalisation by Decompilation)
  12. Research Question 4 (Reuse of Configurations)
  13. Research Question 5 (Ranked Results)
  14. Research Question 6 (Local + Global Code Modifications)

The Experimental Framework

The general framework of our study as shown below consists of 5 main steps. In Step 1, we collect test data consisting of Java source code files. Next, the source files are transformed by applying pervasive modifications at source and bytecode level. In the third step, all original and transformed source files are normalised. A simple form of normalisation is pretty printing the source files which is used in similarity or clone detection. We also use decompilation. In Step 4, the similarity detection tools are executed pairwise against the set of all normalised files, producing similarity reports for every pair. In the last step, the similarity reports are analysed.

Experimental Framework

Tools and Techniques

Several tools and techniques were used in this study. These fall into three categories: obfuscators, decompilers, and detectors. The tool set included source and bytecode obfuscators, and two decompilers. The detectors cover a wide range of similarity measurement techniques and methods including plagiarism and clone detection, compression distance, string matching, and information retrieval. All tools are open source in order to expedite the repeatability of our experiments.

Obfuscators, Decompilers

Type Tool Link
Source code obfuscator Artifice Download
Bytecode obfuscator ProGuard Download
Decompiler Krakatau Download
Decompiler Procyon Download

Similarity Detection Tools

Tool/technique Similarity calculation Link
Clone detectors    
ccfx tokens and suffix tree matching Download
deckard characteristic vectors of AST optimised by LSH Download
iclones tokens and generalised suffix tree Download
nicad TXL and string comparison (LCS) Download
simian line-based string comparison Download
Plagiarism detectors    
jplag-java tokens, Karp Rabin matching, Greedy String Tiling Download
jplag-text tokens, Karp Rabin matching, Greedy String Tiling Download
plaggie N/A (not disclosed) Download
sherlock digital signatures Download
simjava tokens and string alignment Download
simtext tokens and string alignment Download
Compression    
7zncd NCD with 7z  
bzip2ncd NCD with bzip2  
gzipncd NCD with gzip  
xzncd NCD with xz  
icd ICD  
ncd ncd tool with bzlib & zlib Download
Others    
bsdiff bsdiff  
diff diff  
py-difflib Gestalt pattern matching Download
py-fuzzywuzzy fuzzy string matching Download
py-jellyfish approximate and phonetic matching of strings py-ngram fuzzy search based using n-gram Download
py-sklearn cosine similarity from machine learning library Download

Scenario 1

Pervasive Modifications

The data set in this scenario is created to simulate pervasive modifications made to source code by using source and bytecode obfuscators: Artifice, and ProGuard. The total number of pervasively modified source code files is 100. The process of code transformations is displayed below. The data set is called OCD (Obfuscation/Compilation/Decompilation).

Scenario 1

Data set: Obfuscation/Compilation/Decompilation (OCD)

Scenario 2

Decompilation

In this scenario, the data set is based on the same set of 100 source code files generated in Scenario 1. However, we added normalisation through decompilation to the post-processing (step 3 in the framework) by compiling all the transformed files using javac and decompiling them using either Krakatau or Procyon.

Data set: OCDdecomp (krakatau)
Data set: OCDdecomp (procyon)

Scenario 3

Reused Boiler-Plate Code

In this scenario, we want to analyse the tools’ performance against a data set that contains files in which fragments of (boiler-plate) code are reused with or without modifications applied. We choose the training set of SOCO (SOurce COde re-use) data set that has been provided in a competition for discovering monolingual re-used source code amongst a given set of programs. The files in the SOCO data set are adapted from the study by Arwin et al. (2006). We found that many of them share the same or very similar boiler-plate code fragments which perform the same task. Some of the boiler-plate fragments have been modified to adapt them to the environment in which the fragment is re-used.

Data set: the SOCO data set can be obtained from the competition website (http://users.dsic.upv.es/grupos/nle/soco/).

Update: the original website is now not accessible. You can download a copy of the (pretty-printed) SOCO data set here.

Fixed answer key: the answer key for SOCO java training data set containing 97 pairs of reused code download

Scenario 4

Ranked Results

In our three previous scenarios (Ragkhitwetsagul et al., 2016), we compared the tools’ performances using their optimal F-scores. The F-score offers weighted harmonic mean of precision and recall. It is a set- based measure that does not consider any ordering of results. Instead of looking at the results as a set and applying a cut-off threshold to obtain true and false positives, we consider only a subset of the results based on their rankings. We adopt three error measures mainly used in information retrieval: precision-at-n (prec@n), average r- precision (ARP), and mean average precision (MAP) to measure the tools’ performances.

Further reading: Manning et al., Introduction to Information Retrieval, Cambridge University Press. 2008.

Scenario 5

Local + Glocal Modifications

We have two objectives for this experimental scenario. First, we are interested in a situation where local and global code modifications are combined together. Second, we shift our focus from measuring how good our tools are on finding all similar pairs of pervasively modified code as we did in Scenario 1 to measure how good our tools are on finding similar pairs of code based on each pervasive code modification type. The table below presents the 10 pervasive code modification types; including the original (O), source code obfuscation by Artifice (A), decompilation by Krakatau (K), decompilation by Procyon (Pc), bytecode obfuscation by ProGuard and decompilation by Krakatau (PgK), bytecode obfuscation by ProGuard and decompilation by Procyon (PgPc), and four other combinations (AK, APc, APgK, APgPc); and ground truth for each of them. The number of code pairs and true positive pairs of A to APgPc are twice larger than the Original (O) type because of asymmetric similarity between pairs, i.e. Sim(x,y) and Sim(y,x).

TypeModification ObfuscationDecomp.PairTP
SourceBytecode
OOriginal 1,089 55
AArtificeX 2,178 110
KKrakatauX 2,178 110
PcProcyonX 2,178 110
PgKProGuard + KrakatauXX 2,178 110
PgPCProGuard + ProcyonXX 2,178 110
AKArtifice + KrakatauX X 2,178 110
APcArtifice + ProcyonXX 2,178 110
APgKArtifice + ProGuard + KrakatauXXX 2,178 110
APgPcArtifice + ProGuard + ProcyonXXX 2,178 110

We measured the tools’ performance on each Simm(F) set. By applying tools on a pair of original and pervasively modified code, we measure the tools based on one particular type of code modifications at a time.

Data set: SOCOgenerated

Results

RQ1

Performance Comparison

How well do current similarity detection techniques perform in the presence of pervasive source code modifications and boiler-plate code?

Pervasively modified code (OCD data set)

The table below shows the performance of all the tools and their optimal configurations. Using the F-score, ccfx is the winner with the highest value of 0.9760.

ToolSettingsThresholdFPFNAccPrecRecAUCF-scoreRanking
Clone detectors
ccfx (C)*b=5, t=113624240.99520.9760.9760.99950.9761
deckard (T)*mintoken=3017442270.97290.94610.7730.95850.85096
 stride=2         
 similarity=0.95         
iclones (L)*minblock=100363580.91960.90480.48860.70880.634527
 minclone=50         
nicad (L)*UPI=0.5038383460.96160.94510.6540.81640.77323
 minline=8         
 rename=blind         
 abstract=literal         
simian (C)*threshold=451501650.96850.84770.8350.92620.84139
 ignoreVariableNames         
Plagiarism detectors
jplag-javat=719581960.97460.93270.8040.95630.86363
jplag-textt=414662390.96950.92020.7610.96580.833112
plaggieM=819832340.96830.90220.7660.95460.828615
sherlockN=4, Z=261421960.96620.84990.8040.94470.826317
simjavar=16151201520.97280.8760.8480.97110.86185
simtextr=414384220.9540.93830.5780.80750.715325
Compression tools
7zncd-BZip2mx=1,3,545642440.96920.9220.7560.95570.830814
7zncd-Deflatemx=7381222150.96630.86550.7850.94540.823320
7zncd-Deflate64mx=7,9381232150.96620.86450.7850.94530.822921
7zncd-LZMAmx=7,9411152130.96720.87250.7870.94830.827516
7zncd-LZMA2mx=7,9411182130.96690.86960.7870.94820.826218
7zncd-PPMdmx=9421401980.96620.85140.8020.94670.82619
bzip2ncdC=1..938622160.97220.92670.7840.96350.84947
gzipncdC=7311102030.96870.87870.7970.95560.835911
icdma=LZMA250863560.95580.88220.6440.92650.744524
 mx=7,9         
ncd-zlibN/A301042070.96890.88410.7930.95840.836110
ncd-bzlibN/A37822060.97120.90640.7940.96360.84658
xzncd-e391202030.96770.86910.7970.95160.831513
Other similarity analysers
bsdiff*N/A711995770.92240.68010.4230.85620.521630
% diff (L)*N/A1881080.11820.10120.9920.46750.1837 
% diff (T)*N/A265281340.33380.11710.8660.43530.2063 
diff (C)*N/A86261840.9190.56590.8160.93640.668326
py-difflibwhitespace=false28122320.97560.98460.7680.94120.86294
 autojunk=false         
py-fuzzywuzzytoken_set_ratio85581760.97660.93420.8240.97720.87572
py-jellyfishjaro_distance783404780.91820.60560.5220.86190.560729
py-ngramN/A491102240.96660.87580.7760.9410.822922
py-sklearnN/A482924580.9250.64990.5420.91130.591128

Boiler-plate code (SOCO data set)

The table below shows the performance of all the tools and their optimal configurations. Using the F-score, jplag-text is the winner with the highest value of 0.9692.

ToolSettingsThresholdFPFNAccPrecRecAUCF-scoreRanking
Clone detectors
Clone det.
ccfx (C)*b=15,16,17
t=12
2542150.99920.91250.96690.99050.93897
deckard (T)*mintoken=501927170.99930.94170.96250.98230.95205
stride=2
similarity=1.00
iclones (L)*minblock=401920570.99890.95190.87420.94690.911412
minclone=50
nicad (L)*UPI=0.302219510.99900.95490.88740.96940.91999
minline=5
rename=consistent
abstract=condition
simian (L)*threshold=42620170.99940.95610.96250.99210.95933
ignoreVariableNames
midrule
Plag. det.
jplag-javat=122926130.99940.94420.97130.9895 0.95764
jplag-textt=93216120.99960.96500.97350.99390.96921
plaggieM=143336370.99890.92040.91830.97530.919310
sherlockN=5, Z=02222540.99890.94770.88080.99960.913011
simjavar=254618110.99960.96070.97570.99870.96822
simtextr=121773190.99860.85600.95810.98870.904213
midrule
Compression
7zncd-BZip2mx=1,3,564241180.99790.93310.73950.99010.825126
7zncd-Deflatemx=76427970.99820.92950.78590.99370.851724
7zncd-Deflate64mx=76427960.99820.92970.78810.99570.853023
7zncd-LZMAmx=7,96911990.99840.96990.78150.99400.865520
7zncd-LZMA2mx=7,96911990.99840.96990.78150.99390.865520
7zncd-PPMdmx=968191060.99810.94810.76600.99480.847425
bzip2ncdC=1,2,3,..,8,95420940.99830.94730.79250.99440.863022
gzipncdC=95425820.99840.93690.81900.99610.874016
icdma=LZMA
mx=1,3
84121510.99760.96180.66670.97360.787527
ncd-zlibN/A5710910.99850.97310.79910.99830.877614
ncd-bzlibN/A5230820.99830.92520.81900.99430.868918
xzncd2,36413940.99840.96510.79250.99420.870317
6,7,8,9,e65
midrule
Others
bsdiffN/A9021252120.96520.10190.53200.91610.171029
diff (C)N/A29774550.88450.05470.98900.91800.103630
difflibautojunk=true4230210.99920.93510.95360.99990.94436
whitespace=true
fuzzywuzzyratio6530300.99910.93380.93380.99890.93388
jellyfishjaro_distance8201620.99761.00000.64240.95550.782328
ngramN/A5920840.99840.94860.81460.99670.876515
cosineN/A6850680.99820.88510.84990.99730.867119

Area Under the ROC Curve (AUC)

OCD data set: The ROC curves with their respective AUCs are displayed below. In term of the overall performance across the whole range of T, ccfx is the best with the highest AUC of 0.9995.

SOCO data set: The ROC curves with their respective AUCs are displayed below. In term of the overall performance across the whole range of T, difflib is the best with the highest AUC of 0.9999.

RQ2

Optimal Configurations

What are the best parameter settings and similarity thresholds for the techniques?

The optimal configurations of the tools can be found from the table above (2nd and 3rd column). Specifically, we inspected ccfx’s configurations and found the tool has its best configuration around b=19, t={7,9} and b=5, t={11,12} as depicted in the figure below.

From the scatter plot below, we can see that the default settings of ccfx, b=50, t=12 (denoted with a red X symbol), provides a decent precision but very low recall. We observed that one cannot tune ccfx to obtain the highest precision without sacrificing recall.

RQ3

Normalisation by Decompilation

How much does compilation followed by decompilation as a pre-processing normalisation method improve detection results of pervasively modified code?

The results after adding compilation and decompilation for normalisation to the post-processing step before performing similarity detection is shown in the figure below. We can clearly observe that decompilation by both Krakatau and Procyon boosts the F-scores of every tool in the study.

The complete results can be found from the two tables below:

Decompiled by Krakatau

ToolSettingsThresholdFPFNAccPrecRecAUCF-scoreRanking
Clone detectors
ccfx (C)*b=5, t=113624240.99520.9760.9760.99950.9761
deckard (T)*mintoken=3017442270.97290.94610.7730.95850.85096
 stride=2         
 similarity=0.95         
iclones (L)*minblock=100363580.91960.90480.48860.70880.634527
 minclone=50         
nicad (L)*UPI=0.5038383460.96160.94510.6540.81640.77323
 minline=8         
 rename=blind         
 abstract=literal         
simian (C)*threshold=451501650.96850.84770.8350.92620.84139
 ignoreVariableNames         
Plagiarism detectors
jplag-javat=719581960.97460.93270.8040.95630.86363
jplag-textt=414662390.96950.92020.7610.96580.833112
plaggieM=819832340.96830.90220.7660.95460.828615
sherlockN=4, Z=261421960.96620.84990.8040.94470.826317
simjavar=16151201520.97280.8760.8480.97110.86185
simtextr=414384220.9540.93830.5780.80750.715325
Compression tools
7zncd-BZip2mx=1,3,545642440.96920.9220.7560.95570.830814
7zncd-Deflatemx=7381222150.96630.86550.7850.94540.823320
7zncd-Deflate64mx=7,9381232150.96620.86450.7850.94530.822921
7zncd-LZMAmx=7,9411152130.96720.87250.7870.94830.827516
7zncd-LZMA2mx=7,9411182130.96690.86960.7870.94820.826218
7zncd-PPMdmx=9421401980.96620.85140.8020.94670.82619
bzip2ncdC=1..938622160.97220.92670.7840.96350.84947
gzipncdC=7311102030.96870.87870.7970.95560.835911
icdma=LZMA250863560.95580.88220.6440.92650.744524
 mx=7,9         
ncd-zlibN/A301042070.96890.88410.7930.95840.836110
ncd-bzlibN/A37822060.97120.90640.7940.96360.84658
xzncd-e391202030.96770.86910.7970.95160.831513
Other similarity analysers
bsdiff*N/A711995770.92240.68010.4230.85620.521630
% diff (L)*N/A1881080.11820.10120.9920.46750.1837 
% diff (T)*N/A265281340.33380.11710.8660.43530.2063 
diff (C)*N/A86261840.9190.56590.8160.93640.668326
py-difflibwhitespace=false28122320.97560.98460.7680.94120.86294
 autojunk=false         
py-fuzzywuzzytoken_set_ratio85581760.97660.93420.8240.97720.87572
py-jellyfishjaro_distance783404780.91820.60560.5220.86190.560729
py-ngramN/A491102240.96660.87580.7760.9410.822922
py-sklearnN/A482924580.9250.64990.5420.91130.591128

The complete list of the optimal configurations

the complete list of optimal configurations for Krakatau is shown below.

ToolSettingsGranularityThreshold
ccfxb=5, t=8T50
b=5, t=11T17
deckardmintoken=30, stride=1, similarity=0.95L, T, C29,29,34
mintoken=30, stride=1, similarity=1.00L, T, C10,12,15
mintoken=30, stride=2, similarity=0.90T54
mintoken=30, stride=2, similarity=0.95L, T, C22,28,32
mintoken=30, stride=inf, similarity=0.95L, T, C29,29,34
mintoken=30, stride=inf, similarity=1.00L, T, C10,12,15
mintoken=50, stride=1, similarity=0.95L, T, C23,29,31
mintoken=50, stride=2, similarity=0.95L, T, C21,18,23
mintoken=50, stride=inf, similarity=0.95L, T, C23,28,31
simianthreshold={3,4}, ignoreidentifiersT17
threshold=3, ignoreidentifiersC
threshold=5, ignoreidentifiersL12
threshold=5, ignoreidentifiersT13
simjavar=18,1917
r=2016
r=26,2711
r=289
r=default27
simtextr=433
r=531
icdma=LZMA, mx={7,9}54
ma=LZMA2, mx={7,9}

Decompiled by Procyon

ToolSettingsThresholdFPFNAccPrecRecAUCF-scoreRanking
Clone detectors
ccfx* (L)b=20 t=1..7114380.99580.99590.9620.9970.97864
deckard* (T)mintoken=30100320.996810.9680.99780.98372
 stride=1 inf         
 similarity=1.00         
iclones* (C)minblock=10018980.98840.98040.9020.95080.939611
 minclone=50         
nicad* (W)UPI=0.3011161000.98840.98250.90.95360.939412
 minline=10         
 rename=blind         
 abstract=condition, literal         
simian* (C)threshold=3238700.99220.99150.930.99870.95988
 ignoreIdentifiers         
Plagiarism detectors
jplag-javat=8220720.992810.9280.98870.96277
jplag-textt=91116480.99360.98350.9520.99820.96756
plaggieM=13 141016800.99040.98290.920.97730.95049
sherlockN=1 Z=05528160.99560.97230.9840.99970.97815
simjavar=default11800.99920.992110.99990.9961
simtextr=415421000.98580.95540.90.96860.926914
 r=default0        
Compression tools
7zncd-BZip2mx=1 3 551301160.98540.96720.8840.99090.923716
7zncd-Deflatemx=949251540.98210.97130.8460.98270.904320
7zncd-Deflate64mx=949251540.98210.97130.8460.98270.904320
7zncd-LZMAmx=7 952161640.9820.98120.8360.98430.902823
7zncd-LZMA2mx=7 952171640.98190.98010.8360.98410.902324
7zncd-PPMdmx=953221220.98560.97560.8780.98610.924215
bzip2ncdC=1..9 default47121400.98480.98620.860.99220.918818
gzipncdC=336401330.98270.95590.8670.98460.909325
icdma=LZMA, mx=7 954371500.98130.95830.850.97210.9009 
 ma=LZMA2, mx=7 9         
ncd-zlibN/A41301580.98120.96560.8420.98760.899626
ncd-bzlibN/A4781400.98520.99080.860.99230.920817
xzncd-e49351480.98170.96050.8520.9860.90322
Other similarity analysers
bsdiffN/A73482360.97160.94090.7640.96060.843329
diff (C)N/A2362440.9750.99210.7560.98260.858128
py-difflibautojunk=true2612940.98940.98690.9060.97880.944710
py-fuzzywuzzytoken_set_ratio900360.996410.9640.99920.98173
py-jellyfishjaro_winkler87842700.96460.89680.730.92180.804930
py-ngramN/A5881920.980.99020.8080.97140.889927
py-sklearnN/A6954740.98720.94490.9260.98970.935412

Statistical Tests

To strengthen the findings, we performed a statistical test to see if the performance before and after normalisation via decompilation differ with statistically significance. We chose non-parametric two-tailed Wilcoxon signed-rank test (however, we also tried using the randomisation (i.e. permutation) test on the results and found identical test results in all cases.) and performed the test with a confidence interval value of 95% (i.e. alpha <= 0.05). The table below shows that the observed F-scores before and after decompilation are different with statistical significance for both Krakatau and Procyon. We complement the statistical test by employing a non-parametric effect size measure called Vargha and Delaney’s A measure to measure the level of differences between two populations. Vargha and Delaney define 3 levels of a difference; small, medium, and large; by having A value over 0.56, 0.64, and 0.71, respectively. Using this scale, our F-score differences between before and after decompilation by Krakatau A=0.031) and Procyon (A=0.063) are small.

Testp-valueSignificant?Vargha-Delaney's A
Before-after decompiled by Krakatau1.863e-09Yes0.031
Before-after decompiled by Procyon1.863e-09Yes0.063

To strengthen the findings, we performed a statistical test to see if the performance before and after normalisation via decompilation differ with statistically significance. We chose non-parametric two-tailed Wilcoxon signed-rank test (however, we also tried using the randomisation (i.e. permutation) test on the results and found identical test results in all cases.) and performed the test with a confidence interval value of 95% (i.e. alpha <= 0.05). The table below shows that the observed F-scores before and after decompilation are different with statistical significance for both Krakatau and Procyon. We complement the statistical test by employing a non-parametric effect size measure called Vargha and Delaney’s A measure to measure the level of differences between two populations. Vargha and Delaney define 3 levels of a difference; small, medium, and large; by having A value over 0.56, 0.64, and 0.71, respectively. Using this scale, our F-score differences between before and after decompilation by Krakatau A=0.031) and Procyon (A=0.063) are small.

RQ4

Reuse of Configurations

Can we reuse optimal configurations from one data set in another data set effectively?

For the 10 highest ranking tools from RQ1, we applied the derived optimal configurations obtained from the generated data set (denoted as Cgenerated* to the SOCO data set. The table below shows that using these configurations has a detrimental impact on the similarity detection results for another data set, even for tools that have no parameters (e.g. ncd-zlib and ncd-bzlib) and are only influenced by the chosen similarity threshold.

ToolsOCDSOCO
 SettingsTOCDSOCOSettingsTSOCO
 F-scoreF-scoreF-score
ccfx (C)b=5 t=11   360.9760.8441b=15 16 17   250.9389
     t=12  
py-fuzzywuzzytoken_set_ratio850.87570.6012ratio650.9338
jplag-javat=7190.86360.3168t=12290.9576
py-difflibautojunk=false280.86290.2113autojunk=true420.9443
 whitespace=false   whitespace=true  
simjavar=16 150.86180.5888r=25230.9259
deckard (T)mintoken=30170.85090.3305mintoken=50190.952
 stride=2   stride=1  
 similarity=0.95   similarity=1.0  
bzip2ncdC=1..9380.84940.3661C=1 .. 9540.863
ncd-bzlibN/A370.84650.3357N/A520.8689
simian (C)threshold=450.84130.6394threshold=4260.9593
ignoreVariableNamesignoreVariableNames
ncd-zlibN/A300.83610.3454N/A570.8776

RQ5

Ranked Results

Which tools perform best when only the top n results are retrieved?

We applied additional three error measures; precision-at-n (prec@n), average r-precision (ARP) and mean average precision (MAP); adopted from information retrieval to the OCD and SOCO data set. The results are shown below. The complete settings can be found in the CSV files below.

Ranked results: genereated data set

RankingPair-basedQuery-based
F-scoreprec@nARPMAP
1(0.976) ccfx(0.976) ccfx(1.000) ccfx(1.000) ccfx
2(0.876) fuzzywuzzy(0.860) simjava(0.915) fuzzywuzzy(0.949) fuzzywuzzy
3(0.864) jplag-java(0.858) fuzzywuzzy(0.913) ncd-bzlib(0.943) ncd-bzlib
4(0.863) difflib(0.842) simian(0.912) 7zncd-BZip2(0.942) bzip2ncd
5(0.862) simjava(0.836) deckard(0.909) bzip2ncd(0.938) 7zncd-BZip2
6(0.851) deckard(0.836) jplag-java(0.900) 7zncd-PPMd(0.937) gzipncd
7(0.849) bzip2ncd(0.832) bzip2ncd(0.900) gzipncd(0.935) ncd-zlib
8(0.847) ncd-bzlib(0.828) difflib(0.898) ncd-zlib(0.933) jplag-text
9(0.841) simian(0.826) ncd-bzlib(0.898) xzncd(0.930) 7zncd-PPMd
10(0.836) ncd-zlib(0.820) 7zncd-BZip2(0.895) 7zncd-LZMA2(0.929) xzncd

Ranked results: SOCO data set

RankingPair-basedQuery-based
F-scoreprec@nARPMAP
1(0.969) jplag-text(0.965) jplag-text(0.998) jplag-java(0.997) jplag-java
2(0.968) simjava(0.960) simjava(0.998) difflib(0.997) difflib
3(0.959) simian(0.956) simian(0.989) ccfx(0.993) jplag-text
4(0.958) jplag-java(0.947) deckard(0.989) simjava(0.988) simjava
5(0.952) deckard(0.943) jplag-java(0.987) gzipncd(0.987) gzipncd
6(0.944) difflib(0.938) difflib(0.986) jplag-text(0.987) ncd-zlib
7(0.939) ccfx(0.929) ccfx(0.985) ncd-zlib(0.986) sherlock
8(0.934) fuzzywuzzy(0.929) fuzzywuzzy(0.984) 7zncd-Deflate(0.986) 7zncd-Deflate64
9(0.920) nicad(0.914) plaggie(0.984) 7zncd-Deflate64(0.986) 7zncd-Deflate
10(0.919) plaggie(0.901) nicad(0.983) 7zncd-LZMA(0.984) fuzzywuzzy

We also varied several number of n for precision-at-n. The plot of the tools’ performances by multiple precision-at-n values are shown below.

Multiple prec-at-n values for OCD data set

Multiple prec-at-n values for SOCO data set

Statistical Tests

Since ARP are computed based on mean, we performed a statistical test to strengthen our results by testing for statistical significance of differences in the set of r-precision values between tools. We chose a one-tailed non-parametric randomisation test (i.e. permutation test) due to its robustness test results in information retrieval as shown by Smucker et al. (2007) (we also tried using one-tailed Wilcoxon signed-rank test on the results and found identical test results in all cases). We performed the test using 100,000 random samples with a confidence interval value of 95% (i.e. alpha <= 0.05). The statistical test results show, for the generated data set, ccfx is the only tool that dominates other tools on their r-precisions values with statistical significance. For SOCO data set, jplag-java and difflib outperform 7zncd-Deflate, 7zncd-Deflate64, and 7zncd-LZMA with statistical significance.

Similarly, since MAP is also computed based on mean, we performed a one-tailed non-parametric randomisation statistical test on pairwise comparisons of the tools’ MAP values. The test results show, for the generated data set, ccfx dominating other tools’ MAPs with statistical significance. For the SOCO data set, we found that jplag-java and difflib outperform gzipncd, ncd-zlib, sherlock, 7zncd-Deflate64, 7zncd-Deflate, and fuzzywuzzy with statistical significance.

RQ6

Local + Global Modification

How well do the techniques perform when source code containing boiler-plate code clones have been pervasively modified?

We evaluate the tools on a data set combining both local and global code modifications. This question also studies which types of pervasive modifications (source code obfuscation, bytecode obfuscation, compilation/decompilation) provide strong effects to tools’ performances.

Tools’ Performances vs. Individual Pervasive Modification Type

Using the results from Experimental Scenario 5, we present the tools performances based on F-scores in the table below and show the distribution of F-scores in a boxplot. The F-scores are grouped according to the 10 pervasive code modification types. The numbers are highlighted when F-scores are higher than 0.8.

ToolF-score
OAKPcPgKPgPcAKAPcAPgKA PgPc
Clone det.
ccfx (C)*0.89110.37140.00000.62650.00000.10340.00000.29850.00000.1034
deckard (T)*0.96360.92170.16670.33330.03570.22860.16670.32520.03570.2286
iclones (L)*0.50000.00000.00000.03570.00000.00000.00000.00000.00000.0000
nicad (T)*0.58230.00000.00000.00000.00000.00000.00000.00000.00000.0000
simian (L)*0.83500.10340.03570.13560.00000.03570.00000.03570.00000.0357
Plagiarism det.
jplag-java1.00001.00000.74290.95240.29730.45330.75470.97200.29730.4507
jplag-text0.98150.62650.55810.63040.35900.42500.49060.55810.35900.4304
plaggie0.96360.91590.73630.93720.21710.46260.73630.94230.21710.4626
sherlock0.94830.82980.78720.82980.30610.35160.67440.78260.30610.3516
simjava0.96490.98151.00000.75250.31880.39130.80410.75250.31880.3913
simtext0.96490.71910.16670.49320.03570.16670.07020.22580.03570.1667
Compression
7zncd-BZip20.92730.77360.68520.86490.24460.37040.64230.74650.24460.3704
7zncd-Deflate0.94830.75790.69350.84060.24270.33330.63600.74180.24270.3333
7zncd-Deflate640.94830.75790.69350.84060.24270.33330.63600.73730.24270.3333
7zncd-LZMA0.96490.79670.74880.88510.26630.38420.67680.76650.26320.3842
7zncd-LZMA20.96490.79340.75360.88510.27180.39230.67000.76320.26970.4000
7zncd-PPMd0.96230.79650.76280.89090.25810.37960.66670.80190.25810.3796
bzip2ncd0.96490.83050.83020.92730.35900.46810.76120.84480.35620.4681
gzipncd0.96230.79650.76280.89090.25810.37960.66670.80190.25810.3796
icd0.92160.50580.43710.56230.22370.28220.34780.42390.22370.2822
ncd-zlib0.98210.85710.82460.94320.40210.49200.74910.85590.39630.4920
ncd-bzlib0.96490.82690.82690.92730.35290.46340.75000.84480.35000.4719
xzncd0.97340.84160.79250.91980.31330.46150.70350.81480.31330.4615
Others
bsdiff0.43880.22800.15290.20050.11510.13500.12760.15960.11520.1353
diff (C)0.28350.23740.15850.20000.12960.12480.15300.17860.13020.1249
difflib0.98210.95500.89520.95650.47900.50870.86880.93810.46060.5091
fuzzywuzzy1.00000.98210.92590.96360.46510.51160.90740.95410.45570.5116
jellyfish0.92730.72530.64000.66670.24790.35790.55130.50000.24790.3662
ngram1.00000.94640.89520.93460.41100.44900.87850.89080.40540.4578
cosine0.90740.68470.71230.68000.35000.35960.58230.52870.35000.3596

Raw data download

Raw data of the results from the four experimental scenarios can be downloaded below:

Results from the genereated SOCO, and SOCOgen data set