View on GitHub

OCD

OCD Framework: A Comparison of Code Similarity Analysers

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark

CREST centre, Department of Computer Science, University College London

Additional material to:

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark: A Comparison of Code Similarity Analysers. Empirical Software Engineering, 23, 2464–2519 (2018). DOI: 10.1007/s10664-017-9564-7

As an extension to:

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark: Similarity of Source Code in the Presence of Pervasive Modifications. 16th International Working Conference on Source Code Analysis and Manipulation (SCAM), Raleigh, NC, USA, October 2016.

Download the OCD Framework

Abstract

Source code analysis to detect code cloning, code plagiarism, and code reuse suffers from the problem of code modifications. We are interested in two types of code modifications in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local+global code modifications. Our experimental results show that with the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, while in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. Code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

The Experimental Framework
Tools and Techniques
Scenario 1: Pervasive Modifications (data set)
Scenario 2: Decompilation (data set)
Scenario 3: Reused Boiler-Plate Code (data set)
Scenario 4: Ranked Results
Scenario 5: Local + Global Modifications (data set)
Results
Research Question 1 (Performance comparison)
Research Question 2 (Optimal Configurations)
Research Question 3 (Normalisation by Decompilation)
Research Question 4 (Reuse of Configurations)
Research Question 5 (Ranked Results)
Research Question 6 (Local + Global Code Modifications)

The Experimental Framework

The general framework of our study as shown below consists of 5 main steps. In Step 1, we collect test data consisting of Java source code files. Next, the source files are transformed by applying pervasive modifications at source and bytecode level. In the third step, all original and transformed source files are normalised. A simple form of normalisation is pretty printing the source files which is used in similarity or clone detection. We also use decompilation. In Step 4, the similarity detection tools are executed pairwise against the set of all normalised files, producing similarity reports for every pair. In the last step, the similarity reports are analysed.

Experimental Framework

Tools and Techniques

Several tools and techniques were used in this study. These fall into three categories: obfuscators, decompilers, and detectors. The tool set included source and bytecode obfuscators, and two decompilers. The detectors cover a wide range of similarity measurement techniques and methods including plagiarism and clone detection, compression distance, string matching, and information retrieval. All tools are open source in order to expedite the repeatability of our experiments.

Obfuscators, Decompilers

Type	Tool	Link
Source code obfuscator	Artifice	Download
Bytecode obfuscator	ProGuard	Download
Decompiler	Krakatau	Download
Decompiler	Procyon	Download

Similarity Detection Tools

Tool/technique	Similarity calculation	Link
Clone detectors
ccfx	tokens and suffix tree matching	Download
deckard	characteristic vectors of AST optimised by LSH	Download
iclones	tokens and generalised suffix tree	Download
nicad	TXL and string comparison (LCS)	Download
simian	line-based string comparison	Download
Plagiarism detectors
jplag-java	tokens, Karp Rabin matching, Greedy String Tiling	Download
jplag-text	tokens, Karp Rabin matching, Greedy String Tiling	Download
plaggie	N/A (not disclosed)	Download
sherlock	digital signatures	Download
simjava	tokens and string alignment	Download
simtext	tokens and string alignment	Download
Compression
7zncd	NCD with 7z
bzip2ncd	NCD with bzip2
gzipncd	NCD with gzip
xzncd	NCD with xz
icd
ncd	ncd tool with bzlib & zlib	Download
Others
bsdiff
diff
py-difflib	Gestalt pattern matching	Download
py-fuzzywuzzy	fuzzy string matching	Download
py-jellyfish	approximate and phonetic matching of strings py-ngram fuzzy search based using n-gram	Download
py-sklearn	cosine similarity from machine learning library	Download

Scenario 1

Pervasive Modifications

The data set in this scenario is created to simulate pervasive modifications made to source code by using source and bytecode obfuscators: Artifice, and ProGuard. The total number of pervasively modified source code files is 100. The process of code transformations is displayed below. The data set is called OCD (Obfuscation/Compilation/Decompilation).

Scenario 1

Data set: Obfuscation/Compilation/Decompilation (OCD)

Scenario 2

Decompilation

In this scenario, the data set is based on the same set of 100 source code files generated in Scenario 1. However, we added normalisation through decompilation to the post-processing (step 3 in the framework) by compiling all the transformed files using javac and decompiling them using either Krakatau or Procyon.

Data set: OCD^decomp (krakatau)

Data set: OCD^decomp (procyon)

Scenario 3

Reused Boiler-Plate Code

In this scenario, we want to analyse the tools’ performance against a data set that contains files in which fragments of (boiler-plate) code are reused with or without modifications applied. We choose the training set of SOCO (SOurce COde re-use) data set that has been provided in a competition for discovering monolingual re-used source code amongst a given set of programs. The files in the SOCO data set are adapted from the study by Arwin et al. (2006). We found that many of them share the same or very similar boiler-plate code fragments which perform the same task. Some of the boiler-plate fragments have been modified to adapt them to the environment in which the fragment is re-used.

Data set: the SOCO data set can be obtained from the competition website (http://users.dsic.upv.es/grupos/nle/soco/).

Update: the original website is now not accessible. You can download a copy of the (pretty-printed) SOCO data set here.

Fixed answer key: the answer key for SOCO java training data set containing 97 pairs of reused code download

Scenario 4

Ranked Results

In our three previous scenarios (Ragkhitwetsagul et al., 2016), we compared the tools’ performances using their optimal F-scores. The F-score offers weighted harmonic mean of precision and recall. It is a set- based measure that does not consider any ordering of results. Instead of looking at the results as a set and applying a cut-off threshold to obtain true and false positives, we consider only a subset of the results based on their rankings. We adopt three error measures mainly used in information retrieval: precision-at-n (prec@n), average r- precision (ARP), and mean average precision (MAP) to measure the tools’ performances.

Scenario 5

Local + Glocal Modifications

We have two objectives for this experimental scenario. First, we are interested in a situation where local and global code modifications are combined together. Second, we shift our focus from measuring how good our tools are on finding all similar pairs of pervasively modified code as we did in Scenario 1 to measure how good our tools are on finding similar pairs of code based on each pervasive code modification type. The table below presents the 10 pervasive code modification types; including the original (O), source code obfuscation by Artifice (A), decompilation by Krakatau (K), decompilation by Procyon (P_c), bytecode obfuscation by ProGuard and decompilation by Krakatau (P_gK), bytecode obfuscation by ProGuard and decompilation by Procyon (P_gPc), and four other combinations (AK, AP_c, AP_gK, AP_gP_c); and ground truth for each of them. The number of code pairs and true positive pairs of A to AP_gP_c are twice larger than the Original (O) type because of asymmetric similarity between pairs, i.e. Sim(x,y) and Sim(y,x).

Type	Modification	Obfuscation		Decomp.	Pair	TP
Type	Modification	Source	Bytecode	Decomp.	Pair	TP
O	Original				1,089	55
A	Artifice	X			2,178	110
K	Krakatau			X	2,178	110
Pc	Procyon			X	2,178	110
PgK	ProGuard + Krakatau		X	X	2,178	110
PgPC	ProGuard + Procyon		X	X	2,178	110
AK	Artifice + Krakatau	X		X	2,178	110
APc	Artifice + Procyon	X		X	2,178	110
APgK	Artifice + ProGuard + Krakatau	X	X	X	2,178	110
APgPc	Artifice + ProGuard + Procyon	X	X	X	2,178	110

We measured the tools’ performance on each Sim_m(F) set. By applying tools on a pair of original and pervasively modified code, we measure the tools based on one particular type of code modifications at a time.

Data set: SOCO^generated

Results

RQ1

Performance Comparison

How well do current similarity detection techniques perform in the presence of pervasive source code modifications and boiler-plate code?

Pervasively modified code (OCD data set)

The table below shows the performance of all the tools and their optimal configurations. Using the F-score, ccfx is the winner with the highest value of 0.9760.

Tool	Settings	Threshold	FP	FN	Acc	Prec	Rec	AUC	F-score	Ranking
Clone detectors
ccfx (C)*	b=5, t=11	36	24	24	0.9952	0.976	0.976	0.9995	0.976	1
deckard (T)*	mintoken=30	17	44	227	0.9729	0.9461	0.773	0.9585	0.8509	6
	stride=2
	similarity=0.95
iclones (L)*	minblock=10	0	36	358	0.9196	0.9048	0.4886	0.7088	0.6345	27
	minclone=50
nicad (L)*	UPI=0.50	38	38	346	0.9616	0.9451	0.654	0.8164	0.773	23
	minline=8
	rename=blind
	abstract=literal
simian (C)*	threshold=4	5	150	165	0.9685	0.8477	0.835	0.9262	0.8413	9
	ignoreVariableNames
Plagiarism detectors
jplag-java	t=7	19	58	196	0.9746	0.9327	0.804	0.9563	0.8636	3
jplag-text	t=4	14	66	239	0.9695	0.9202	0.761	0.9658	0.8331	12
plaggie	M=8	19	83	234	0.9683	0.9022	0.766	0.9546	0.8286	15
sherlock	N=4, Z=2	6	142	196	0.9662	0.8499	0.804	0.9447	0.8263	17
simjava	r=16	15	120	152	0.9728	0.876	0.848	0.9711	0.8618	5
simtext	r=4	14	38	422	0.954	0.9383	0.578	0.8075	0.7153	25
Compression tools
7zncd-BZip2	mx=1,3,5	45	64	244	0.9692	0.922	0.756	0.9557	0.8308	14
7zncd-Deflate	mx=7	38	122	215	0.9663	0.8655	0.785	0.9454	0.8233	20
7zncd-Deflate64	mx=7,9	38	123	215	0.9662	0.8645	0.785	0.9453	0.8229	21
7zncd-LZMA	mx=7,9	41	115	213	0.9672	0.8725	0.787	0.9483	0.8275	16
7zncd-LZMA2	mx=7,9	41	118	213	0.9669	0.8696	0.787	0.9482	0.8262	18
7zncd-PPMd	mx=9	42	140	198	0.9662	0.8514	0.802	0.9467	0.826	19
bzip2ncd	C=1..9	38	62	216	0.9722	0.9267	0.784	0.9635	0.8494	7
gzipncd	C=7	31	110	203	0.9687	0.8787	0.797	0.9556	0.8359	11
icd	ma=LZMA2	50	86	356	0.9558	0.8822	0.644	0.9265	0.7445	24
	mx=7,9
ncd-zlib	N/A	30	104	207	0.9689	0.8841	0.793	0.9584	0.8361	10
ncd-bzlib	N/A	37	82	206	0.9712	0.9064	0.794	0.9636	0.8465	8
xzncd	-e	39	120	203	0.9677	0.8691	0.797	0.9516	0.8315	13
Other similarity analysers
bsdiff*	N/A	71	199	577	0.9224	0.6801	0.423	0.8562	0.5216	30
% diff (L)*	N/A	1	8810	8	0.1182	0.1012	0.992	0.4675	0.1837
% diff (T)*	N/A	2	6528	134	0.3338	0.1171	0.866	0.4353	0.2063
diff (C)*	N/A	8	626	184	0.919	0.5659	0.816	0.9364	0.6683	26
py-difflib	whitespace=false	28	12	232	0.9756	0.9846	0.768	0.9412	0.8629	4
	autojunk=false
py-fuzzywuzzy	token_set_ratio	85	58	176	0.9766	0.9342	0.824	0.9772	0.8757	2
py-jellyfish	jaro_distance	78	340	478	0.9182	0.6056	0.522	0.8619	0.5607	29
py-ngram	N/A	49	110	224	0.9666	0.8758	0.776	0.941	0.8229	22
py-sklearn	N/A	48	292	458	0.925	0.6499	0.542	0.9113	0.5911	28

Tools that do not report similarity value directly. Similarity is measured at the granularity level of line (L), token (T), or character (C).

Boiler-plate code (SOCO data set)

The table below shows the performance of all the tools and their optimal configurations. Using the F-score, jplag-text is the winner with the highest value of 0.9692.

Tool	Settings	Threshold	FP	FN	Acc	Prec	Rec	AUC	F-score	Ranking
Clone detectors
Clone det.
ccfx (C)*	b=15,16,17 t=12	25	42	15	0.9992	0.9125	0.9669	0.9905	0.9389	7
deckard (T)*	mintoken=50	19	27	17	0.9993	0.9417	0.9625	0.9823	0.9520	5
	stride=2
	similarity=1.00
iclones (L)*	minblock=40	19	20	57	0.9989	0.9519	0.8742	0.9469	0.9114	12
	minclone=50
nicad (L)*	UPI=0.30	22	19	51	0.9990	0.9549	0.8874	0.9694	0.9199	9
	minline=5
	rename=consistent
	abstract=condition
simian (L)*	threshold=4	26	20	17	0.9994	0.9561	0.9625	0.9921	0.9593	3
	ignoreVariableNames
midrule
Plag. det.
jplag-java	t=12	29	26	13	0.9994	0.9442	0.9713	0.9895	0.9576	4
jplag-text	t=9	32	16	12	0.9996	0.9650	0.9735	0.9939	0.9692	1
plaggie	M=14	33	36	37	0.9989	0.9204	0.9183	0.9753	0.9193	10
sherlock	N=5, Z=0	22	22	54	0.9989	0.9477	0.8808	0.9996	0.9130	11
simjava	r=25	46	18	11	0.9996	0.9607	0.9757	0.9987	0.9682	2
simtext	r=12	17	73	19	0.9986	0.8560	0.9581	0.9887	0.9042	13
midrule
Compression
7zncd-BZip2	mx=1,3,5	64	24	118	0.9979	0.9331	0.7395	0.9901	0.8251	26
7zncd-Deflate	mx=7	64	27	97	0.9982	0.9295	0.7859	0.9937	0.8517	24
7zncd-Deflate64	mx=7	64	27	96	0.9982	0.9297	0.7881	0.9957	0.8530	23
7zncd-LZMA	mx=7,9	69	11	99	0.9984	0.9699	0.7815	0.9940	0.8655	20
7zncd-LZMA2	mx=7,9	69	11	99	0.9984	0.9699	0.7815	0.9939	0.8655	20
7zncd-PPMd	mx=9	68	19	106	0.9981	0.9481	0.7660	0.9948	0.8474	25
bzip2ncd	C=1,2,3,..,8,9	54	20	94	0.9983	0.9473	0.7925	0.9944	0.8630	22
gzipncd	C=9	54	25	82	0.9984	0.9369	0.8190	0.9961	0.8740	16
icd	ma=LZMA mx=1,3	84	12	151	0.9976	0.9618	0.6667	0.9736	0.7875	27
ncd-zlib	N/A	57	10	91	0.9985	0.9731	0.7991	0.9983	0.8776	14
ncd-bzlib	N/A	52	30	82	0.9983	0.9252	0.8190	0.9943	0.8689	18
xzncd	2,3	64	13	94	0.9984	0.9651	0.7925	0.9942	0.8703	17
	6,7,8,9,e	65
midrule
Others
bsdiff	N/A	90	2125	212	0.9652	0.1019	0.5320	0.9161	0.1710	29
diff (C)	N/A	29	7745	5	0.8845	0.0547	0.9890	0.9180	0.1036	30
difflib	autojunk=true	42	30	21	0.9992	0.9351	0.9536	0.9999	0.9443	6
	whitespace=true
fuzzywuzzy	ratio	65	30	30	0.9991	0.9338	0.9338	0.9989	0.9338	8
jellyfish	jaro_distance	82	0	162	0.9976	1.0000	0.6424	0.9555	0.7823	28
ngram	N/A	59	20	84	0.9984	0.9486	0.8146	0.9967	0.8765	15
cosine	N/A	68	50	68	0.9982	0.8851	0.8499	0.9973	0.8671	19

Tools that do not report similarity value directly. Similarity is measured at the granularity level of line (L), token (T), or character (C).

Area Under the ROC Curve (AUC)

OCD data set: The ROC curves with their respective AUCs are displayed below. In term of the overall performance across the whole range of T, ccfx is the best with the highest AUC of 0.9995.

SOCO data set: The ROC curves with their respective AUCs are displayed below. In term of the overall performance across the whole range of T, difflib is the best with the highest AUC of 0.9999.

RQ2

Optimal Configurations

What are the best parameter settings and similarity thresholds for the techniques?

The optimal configurations of the tools can be found from the table above (2nd and 3rd column). Specifically, we inspected ccfx’s configurations and found the tool has its best configuration around b=19, t={7,9} and b=5, t={11,12} as depicted in the figure below.

From the scatter plot below, we can see that the default settings of ccfx, b=50, t=12 (denoted with a red X symbol), provides a decent precision but very low recall. We observed that one cannot tune ccfx to obtain the highest precision without sacrificing recall.

RQ3

Normalisation by Decompilation

How much does compilation followed by decompilation as a pre-processing normalisation method improve detection results of pervasively modified code?

The results after adding compilation and decompilation for normalisation to the post-processing step before performing similarity detection is shown in the figure below. We can clearly observe that decompilation by both Krakatau and Procyon boosts the F-scores of every tool in the study.

The complete results can be found from the two tables below:

Decompiled by Krakatau

Tool	Settings	Threshold	FP	FN	Acc	Prec	Rec	AUC	F-score	Ranking
Clone detectors
ccfx (C)*	b=5, t=11	36	24	24	0.9952	0.976	0.976	0.9995	0.976	1
deckard (T)*	mintoken=30	17	44	227	0.9729	0.9461	0.773	0.9585	0.8509	6
	stride=2
	similarity=0.95
iclones (L)*	minblock=10	0	36	358	0.9196	0.9048	0.4886	0.7088	0.6345	27
	minclone=50
nicad (L)*	UPI=0.50	38	38	346	0.9616	0.9451	0.654	0.8164	0.773	23
	minline=8
	rename=blind
	abstract=literal
simian (C)*	threshold=4	5	150	165	0.9685	0.8477	0.835	0.9262	0.8413	9
	ignoreVariableNames
Plagiarism detectors
jplag-java	t=7	19	58	196	0.9746	0.9327	0.804	0.9563	0.8636	3
jplag-text	t=4	14	66	239	0.9695	0.9202	0.761	0.9658	0.8331	12
plaggie	M=8	19	83	234	0.9683	0.9022	0.766	0.9546	0.8286	15
sherlock	N=4, Z=2	6	142	196	0.9662	0.8499	0.804	0.9447	0.8263	17
simjava	r=16	15	120	152	0.9728	0.876	0.848	0.9711	0.8618	5
simtext	r=4	14	38	422	0.954	0.9383	0.578	0.8075	0.7153	25
Compression tools
7zncd-BZip2	mx=1,3,5	45	64	244	0.9692	0.922	0.756	0.9557	0.8308	14
7zncd-Deflate	mx=7	38	122	215	0.9663	0.8655	0.785	0.9454	0.8233	20
7zncd-Deflate64	mx=7,9	38	123	215	0.9662	0.8645	0.785	0.9453	0.8229	21
7zncd-LZMA	mx=7,9	41	115	213	0.9672	0.8725	0.787	0.9483	0.8275	16
7zncd-LZMA2	mx=7,9	41	118	213	0.9669	0.8696	0.787	0.9482	0.8262	18
7zncd-PPMd	mx=9	42	140	198	0.9662	0.8514	0.802	0.9467	0.826	19
bzip2ncd	C=1..9	38	62	216	0.9722	0.9267	0.784	0.9635	0.8494	7
gzipncd	C=7	31	110	203	0.9687	0.8787	0.797	0.9556	0.8359	11
icd	ma=LZMA2	50	86	356	0.9558	0.8822	0.644	0.9265	0.7445	24
	mx=7,9
ncd-zlib	N/A	30	104	207	0.9689	0.8841	0.793	0.9584	0.8361	10
ncd-bzlib	N/A	37	82	206	0.9712	0.9064	0.794	0.9636	0.8465	8
xzncd	-e	39	120	203	0.9677	0.8691	0.797	0.9516	0.8315	13
Other similarity analysers
bsdiff*	N/A	71	199	577	0.9224	0.6801	0.423	0.8562	0.5216	30
% diff (L)*	N/A	1	8810	8	0.1182	0.1012	0.992	0.4675	0.1837
% diff (T)*	N/A	2	6528	134	0.3338	0.1171	0.866	0.4353	0.2063
diff (C)*	N/A	8	626	184	0.919	0.5659	0.816	0.9364	0.6683	26
py-difflib	whitespace=false	28	12	232	0.9756	0.9846	0.768	0.9412	0.8629	4
	autojunk=false
py-fuzzywuzzy	token_set_ratio	85	58	176	0.9766	0.9342	0.824	0.9772	0.8757	2
py-jellyfish	jaro_distance	78	340	478	0.9182	0.6056	0.522	0.8619	0.5607	29
py-ngram	N/A	49	110	224	0.9666	0.8758	0.776	0.941	0.8229	22
py-sklearn	N/A	48	292	458	0.925	0.6499	0.542	0.9113	0.5911	28

The complete list of the optimal configurations

the complete list of optimal configurations for Krakatau is shown below.

Tool	Settings	Granularity	Threshold
ccfx	b=5, t=8	T	50
	b=5, t=11	T	17
deckard	mintoken=30, stride=1, similarity=0.95	L, T, C	29,29,34
	mintoken=30, stride=1, similarity=1.00	L, T, C	10,12,15
	mintoken=30, stride=2, similarity=0.90	T	54
	mintoken=30, stride=2, similarity=0.95	L, T, C	22,28,32
	mintoken=30, stride=inf, similarity=0.95	L, T, C	29,29,34
	mintoken=30, stride=inf, similarity=1.00	L, T, C	10,12,15
	mintoken=50, stride=1, similarity=0.95	L, T, C	23,29,31
	mintoken=50, stride=2, similarity=0.95	L, T, C	21,18,23
	mintoken=50, stride=inf, similarity=0.95	L, T, C	23,28,31
simian	threshold={3,4}, ignoreidentifiers	T	17
	threshold=3, ignoreidentifiers	C
	threshold=5, ignoreidentifiers	L	12
	threshold=5, ignoreidentifiers	T	13
simjava	r=18,19		17
	r=20		16
	r=26,27		11
	r=28		9
	r=default		27
simtext	r=4		33
	r=5		31
icd	ma=LZMA, mx={7,9}		54
	ma=LZMA2, mx={7,9}

Decompiled by Procyon

Tool	Settings	Threshold	FP	FN	Acc	Prec	Rec	AUC	F-score	Ranking
Clone detectors
ccfx* (L)	b=20 t=1..7	11	4	38	0.9958	0.9959	0.962	0.997	0.9786	4
deckard* (T)	mintoken=30	10	0	32	0.9968	1	0.968	0.9978	0.9837	2
	stride=1 inf
	similarity=1.00
iclones* (C)	minblock=10	0	18	98	0.9884	0.9804	0.902	0.9508	0.9396	11
	minclone=50
nicad* (W)	UPI=0.30	11	16	100	0.9884	0.9825	0.9	0.9536	0.9394	12
	minline=10
	rename=blind
	abstract=condition, literal
simian* (C)	threshold=3	23	8	70	0.9922	0.9915	0.93	0.9987	0.9598	8
	ignoreIdentifiers
Plagiarism detectors
jplag-java	t=8	22	0	72	0.9928	1	0.928	0.9887	0.9627	7
jplag-text	t=9	11	16	48	0.9936	0.9835	0.952	0.9982	0.9675	6
plaggie	M=13 14	10	16	80	0.9904	0.9829	0.92	0.9773	0.9504	9
sherlock	N=1 Z=0	55	28	16	0.9956	0.9723	0.984	0.9997	0.9781	5
simjava	r=default	11	8	0	0.9992	0.9921	1	0.9999	0.996	1
simtext	r=4	15	42	100	0.9858	0.9554	0.9	0.9686	0.9269	14
	r=default	0
Compression tools
7zncd-BZip2	mx=1 3 5	51	30	116	0.9854	0.9672	0.884	0.9909	0.9237	16
7zncd-Deflate	mx=9	49	25	154	0.9821	0.9713	0.846	0.9827	0.9043	20
7zncd-Deflate64	mx=9	49	25	154	0.9821	0.9713	0.846	0.9827	0.9043	20
7zncd-LZMA	mx=7 9	52	16	164	0.982	0.9812	0.836	0.9843	0.9028	23
7zncd-LZMA2	mx=7 9	52	17	164	0.9819	0.9801	0.836	0.9841	0.9023	24
7zncd-PPMd	mx=9	53	22	122	0.9856	0.9756	0.878	0.9861	0.9242	15
bzip2ncd	C=1..9 default	47	12	140	0.9848	0.9862	0.86	0.9922	0.9188	18
gzipncd	C=3	36	40	133	0.9827	0.9559	0.867	0.9846	0.9093	25
icd	ma=LZMA, mx=7 9	54	37	150	0.9813	0.9583	0.85	0.9721	0.9009
	ma=LZMA2, mx=7 9
ncd-zlib	N/A	41	30	158	0.9812	0.9656	0.842	0.9876	0.8996	26
ncd-bzlib	N/A	47	8	140	0.9852	0.9908	0.86	0.9923	0.9208	17
xzncd	-e	49	35	148	0.9817	0.9605	0.852	0.986	0.903	22
Other similarity analysers
bsdiff	N/A	73	48	236	0.9716	0.9409	0.764	0.9606	0.8433	29
diff (C)	N/A	23	6	244	0.975	0.9921	0.756	0.9826	0.8581	28
py-difflib	autojunk=true	26	12	94	0.9894	0.9869	0.906	0.9788	0.9447	10
py-fuzzywuzzy	token_set_ratio	90	0	36	0.9964	1	0.964	0.9992	0.9817	3
py-jellyfish	jaro_winkler	87	84	270	0.9646	0.8968	0.73	0.9218	0.8049	30
py-ngram	N/A	58	8	192	0.98	0.9902	0.808	0.9714	0.8899	27
py-sklearn	N/A	69	54	74	0.9872	0.9449	0.926	0.9897	0.9354	12

Statistical Tests

To strengthen the findings, we performed a statistical test to see if the performance before and after normalisation via decompilation differ with statistically significance. We chose non-parametric two-tailed Wilcoxon signed-rank test (however, we also tried using the randomisation (i.e. permutation) test on the results and found identical test results in all cases.) and performed the test with a confidence interval value of 95% (i.e. alpha <= 0.05). The table below shows that the observed F-scores before and after decompilation are different with statistical significance for both Krakatau and Procyon. We complement the statistical test by employing a non-parametric effect size measure called Vargha and Delaney’s A measure to measure the level of differences between two populations. Vargha and Delaney define 3 levels of a difference; small, medium, and large; by having A value over 0.56, 0.64, and 0.71, respectively. Using this scale, our F-score differences between before and after decompilation by Krakatau A=0.031) and Procyon (A=0.063) are small.

Test	p-value	Significant?	Vargha-Delaney's A
Before-after decompiled by Krakatau	1.863e-09	Yes	0.031
Before-after decompiled by Procyon	1.863e-09	Yes	0.063

To strengthen the findings, we performed a statistical test to see if the performance before and after normalisation via decompilation differ with statistically significance. We chose non-parametric two-tailed Wilcoxon signed-rank test (however, we also tried using the randomisation (i.e. permutation) test on the results and found identical test results in all cases.) and performed the test with a confidence interval value of 95% (i.e. alpha <= 0.05). The table below shows that the observed F-scores before and after decompilation are different with statistical significance for both Krakatau and Procyon. We complement the statistical test by employing a non-parametric effect size measure called Vargha and Delaney’s A measure to measure the level of differences between two populations. Vargha and Delaney define 3 levels of a difference; small, medium, and large; by having A value over 0.56, 0.64, and 0.71, respectively. Using this scale, our F-score differences between before and after decompilation by Krakatau A=0.031) and Procyon (A=0.063) are small.

RQ4

Reuse of Configurations

Can we reuse optimal configurations from one data set in another data set effectively?

For the 10 highest ranking tools from RQ1, we applied the derived optimal configurations obtained from the generated data set (denoted as Cgenerated* to the SOCO data set. The table below shows that using these configurations has a detrimental impact on the similarity detection results for another data set, even for tools that have no parameters (e.g. ncd-zlib and ncd-bzlib) and are only influenced by the chosen similarity threshold.

Tools	OCD				SOCO
	Settings	T	OCD	SOCO	Settings	T	SOCO
	Settings	T	F-score	F-score	Settings	T	F-score
ccfx (C)	b=5 t=11	36	0.976	0.8441	b=15 16 17	25	0.9389
					t=12
py-fuzzywuzzy	token_set_ratio	85	0.8757	0.6012	ratio	65	0.9338
jplag-java	t=7	19	0.8636	0.3168	t=12	29	0.9576
py-difflib	autojunk=false	28	0.8629	0.2113	autojunk=true	42	0.9443
	whitespace=false				whitespace=true
simjava	r=16	15	0.8618	0.5888	r=25	23	0.9259
deckard (T)	mintoken=30	17	0.8509	0.3305	mintoken=50	19	0.952
	stride=2				stride=1
	similarity=0.95				similarity=1.0
bzip2ncd	C=1..9	38	0.8494	0.3661	C=1 .. 9	54	0.863
ncd-bzlib	N/A	37	0.8465	0.3357	N/A	52	0.8689
simian (C)	threshold=4	5	0.8413	0.6394	threshold=4	26	0.9593
	ignoreVariableNames				ignoreVariableNames
ncd-zlib	N/A	30	0.8361	0.3454	N/A	57	0.8776

RQ5

Ranked Results

Which tools perform best when only the top n results are retrieved?

We applied additional three error measures; precision-at-n (prec@n), average r-precision (ARP) and mean average precision (MAP); adopted from information retrieval to the OCD and SOCO data set. The results are shown below. The complete settings can be found in the CSV files below.

Ranked results: genereated data set

Ranking	Pair-based		Query-based
Ranking	F-score	prec@n	ARP	MAP
1	(0.976) ccfx	(0.976) ccfx	(1.000) ccfx	(1.000) ccfx
2	(0.876) fuzzywuzzy	(0.860) simjava	(0.915) fuzzywuzzy	(0.949) fuzzywuzzy
3	(0.864) jplag-java	(0.858) fuzzywuzzy	(0.913) ncd-bzlib	(0.943) ncd-bzlib
4	(0.863) difflib	(0.842) simian	(0.912) 7zncd-BZip2	(0.942) bzip2ncd
5	(0.862) simjava	(0.836) deckard	(0.909) bzip2ncd	(0.938) 7zncd-BZip2
6	(0.851) deckard	(0.836) jplag-java	(0.900) 7zncd-PPMd	(0.937) gzipncd
7	(0.849) bzip2ncd	(0.832) bzip2ncd	(0.900) gzipncd	(0.935) ncd-zlib
8	(0.847) ncd-bzlib	(0.828) difflib	(0.898) ncd-zlib	(0.933) jplag-text
9	(0.841) simian	(0.826) ncd-bzlib	(0.898) xzncd	(0.930) 7zncd-PPMd
10	(0.836) ncd-zlib	(0.820) 7zncd-BZip2	(0.895) 7zncd-LZMA2	(0.929) xzncd

Ranked results: SOCO data set

Ranking	Pair-based		Query-based
Ranking	F-score	prec@n	ARP	MAP
1	(0.969) jplag-text	(0.965) jplag-text	(0.998) jplag-java	(0.997) jplag-java
2	(0.968) simjava	(0.960) simjava	(0.998) difflib	(0.997) difflib
3	(0.959) simian	(0.956) simian	(0.989) ccfx	(0.993) jplag-text
4	(0.958) jplag-java	(0.947) deckard	(0.989) simjava	(0.988) simjava
5	(0.952) deckard	(0.943) jplag-java	(0.987) gzipncd	(0.987) gzipncd
6	(0.944) difflib	(0.938) difflib	(0.986) jplag-text	(0.987) ncd-zlib
7	(0.939) ccfx	(0.929) ccfx	(0.985) ncd-zlib	(0.986) sherlock
8	(0.934) fuzzywuzzy	(0.929) fuzzywuzzy	(0.984) 7zncd-Deflate	(0.986) 7zncd-Deflate64
9	(0.920) nicad	(0.914) plaggie	(0.984) 7zncd-Deflate64	(0.986) 7zncd-Deflate
10	(0.919) plaggie	(0.901) nicad	(0.983) 7zncd-LZMA	(0.984) fuzzywuzzy

We also varied several number of n for precision-at-n. The plot of the tools’ performances by multiple precision-at-n values are shown below.

Multiple prec-at-n values for OCD data set

Multiple prec-at-n values for SOCO data set

Statistical Tests

Since ARP are computed based on mean, we performed a statistical test to strengthen our results by testing for statistical significance of differences in the set of r-precision values between tools. We chose a one-tailed non-parametric randomisation test (i.e. permutation test) due to its robustness test results in information retrieval as shown by Smucker et al. (2007) (we also tried using one-tailed Wilcoxon signed-rank test on the results and found identical test results in all cases). We performed the test using 100,000 random samples with a confidence interval value of 95% (i.e. alpha <= 0.05). The statistical test results show, for the generated data set, ccfx is the only tool that dominates other tools on their r-precisions values with statistical significance. For SOCO data set, jplag-java and difflib outperform 7zncd-Deflate, 7zncd-Deflate64, and 7zncd-LZMA with statistical significance.

Similarly, since MAP is also computed based on mean, we performed a one-tailed non-parametric randomisation statistical test on pairwise comparisons of the tools’ MAP values. The test results show, for the generated data set, ccfx dominating other tools’ MAPs with statistical significance. For the SOCO data set, we found that jplag-java and difflib outperform gzipncd, ncd-zlib, sherlock, 7zncd-Deflate64, 7zncd-Deflate, and fuzzywuzzy with statistical significance.

RQ6

Local + Global Modification

How well do the techniques perform when source code containing boiler-plate code clones have been pervasively modified?

We evaluate the tools on a data set combining both local and global code modifications. This question also studies which types of pervasive modifications (source code obfuscation, bytecode obfuscation, compilation/decompilation) provide strong effects to tools’ performances.

Tools’ Performances vs. Individual Pervasive Modification Type

Using the results from Experimental Scenario 5, we present the tools performances based on F-scores in the table below and show the distribution of F-scores in a boxplot. The F-scores are grouped according to the 10 pervasive code modification types. The numbers are highlighted when F-scores are higher than 0.8.

Tool	F-score
	O	A	K	P_c	P_gK	P_gP_c	AK	AP_c	AP_gK	A P_gP_c
Clone det.
ccfx (C)*	0.8911	0.3714	0.0000	0.6265	0.0000	0.1034	0.0000	0.2985	0.0000	0.1034
deckard (T)*	0.9636	0.9217	0.1667	0.3333	0.0357	0.2286	0.1667	0.3252	0.0357	0.2286
iclones (L)*	0.5000	0.0000	0.0000	0.0357	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
nicad (T)*	0.5823	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
simian (L)*	0.8350	0.1034	0.0357	0.1356	0.0000	0.0357	0.0000	0.0357	0.0000	0.0357
Plagiarism det.
jplag-java	1.0000	1.0000	0.7429	0.9524	0.2973	0.4533	0.7547	0.9720	0.2973	0.4507
jplag-text	0.9815	0.6265	0.5581	0.6304	0.3590	0.4250	0.4906	0.5581	0.3590	0.4304
plaggie	0.9636	0.9159	0.7363	0.9372	0.2171	0.4626	0.7363	0.9423	0.2171	0.4626
sherlock	0.9483	0.8298	0.7872	0.8298	0.3061	0.3516	0.6744	0.7826	0.3061	0.3516
simjava	0.9649	0.9815	1.0000	0.7525	0.3188	0.3913	0.8041	0.7525	0.3188	0.3913
simtext	0.9649	0.7191	0.1667	0.4932	0.0357	0.1667	0.0702	0.2258	0.0357	0.1667
Compression
7zncd-BZip2	0.9273	0.7736	0.6852	0.8649	0.2446	0.3704	0.6423	0.7465	0.2446	0.3704
7zncd-Deflate	0.9483	0.7579	0.6935	0.8406	0.2427	0.3333	0.6360	0.7418	0.2427	0.3333
7zncd-Deflate64	0.9483	0.7579	0.6935	0.8406	0.2427	0.3333	0.6360	0.7373	0.2427	0.3333
7zncd-LZMA	0.9649	0.7967	0.7488	0.8851	0.2663	0.3842	0.6768	0.7665	0.2632	0.3842
7zncd-LZMA2	0.9649	0.7934	0.7536	0.8851	0.2718	0.3923	0.6700	0.7632	0.2697	0.4000
7zncd-PPMd	0.9623	0.7965	0.7628	0.8909	0.2581	0.3796	0.6667	0.8019	0.2581	0.3796
bzip2ncd	0.9649	0.8305	0.8302	0.9273	0.3590	0.4681	0.7612	0.8448	0.3562	0.4681
gzipncd	0.9623	0.7965	0.7628	0.8909	0.2581	0.3796	0.6667	0.8019	0.2581	0.3796
icd	0.9216	0.5058	0.4371	0.5623	0.2237	0.2822	0.3478	0.4239	0.2237	0.2822
ncd-zlib	0.9821	0.8571	0.8246	0.9432	0.4021	0.4920	0.7491	0.8559	0.3963	0.4920
ncd-bzlib	0.9649	0.8269	0.8269	0.9273	0.3529	0.4634	0.7500	0.8448	0.3500	0.4719
xzncd	0.9734	0.8416	0.7925	0.9198	0.3133	0.4615	0.7035	0.8148	0.3133	0.4615
Others
bsdiff	0.4388	0.2280	0.1529	0.2005	0.1151	0.1350	0.1276	0.1596	0.1152	0.1353
diff (C)	0.2835	0.2374	0.1585	0.2000	0.1296	0.1248	0.1530	0.1786	0.1302	0.1249
difflib	0.9821	0.9550	0.8952	0.9565	0.4790	0.5087	0.8688	0.9381	0.4606	0.5091
fuzzywuzzy	1.0000	0.9821	0.9259	0.9636	0.4651	0.5116	0.9074	0.9541	0.4557	0.5116
jellyfish	0.9273	0.7253	0.6400	0.6667	0.2479	0.3579	0.5513	0.5000	0.2479	0.3662
ngram	1.0000	0.9464	0.8952	0.9346	0.4110	0.4490	0.8785	0.8908	0.4054	0.4578
cosine	0.9074	0.6847	0.7123	0.6800	0.3500	0.3596	0.5823	0.5287	0.3500	0.3596

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark

CREST centre, Department of Computer Science, University College London

Download the OCD Framework

Abstract

Contents

The Experimental Framework

Tools and Techniques

Obfuscators, Decompilers

Similarity Detection Tools

Scenario 1

Pervasive Modifications

Data set: Obfuscation/Compilation/Decompilation (OCD)

Scenario 2

Decompilation

Data set: OCDdecomp (krakatau)

Data set: OCDdecomp (procyon)

Scenario 3

Reused Boiler-Plate Code

Data set: the SOCO data set can be obtained from the competition website (http://users.dsic.upv.es/grupos/nle/soco/).

Fixed answer key: the answer key for SOCO java training data set containing 97 pairs of reused code download

Scenario 4

Ranked Results

Further reading: Manning et al., Introduction to Information Retrieval, Cambridge University Press. 2008.

Scenario 5

Local + Glocal Modifications

Data set: SOCOgenerated

Results

RQ1

Performance Comparison

Pervasively modified code (OCD data set)

Boiler-plate code (SOCO data set)

Area Under the ROC Curve (AUC)

RQ2

Optimal Configurations

RQ3

Normalisation by Decompilation

Decompiled by Krakatau

The complete list of the optimal configurations

Decompiled by Procyon

Statistical Tests

RQ4

Reuse of Configurations

RQ5

Ranked Results

Ranked results: genereated data set

Ranked results: SOCO data set

Multiple prec-at-n values for OCD data set

Multiple prec-at-n values for SOCO data set

Statistical Tests

RQ6

Local + Global Modification

Tools’ Performances vs. Individual Pervasive Modification Type

Raw data download

Raw data of the results from the four experimental scenarios can be downloaded below:

Results from the genereated SOCO, and SOCOgen data set

Data set: OCD^decomp (krakatau)

Data set: OCD^decomp (procyon)

Data set: SOCO^generated

Results from the genereated SOCO, and SOCO^gen data set