Siamese (Scalable, incremental, and multi-representation) is a code clone search system powered by Elasticsearch with code clone detection approaches, including code normalisation, n-grams, and query reduction technique, built on top. It can scalably search for clones of Type-1 to Type-3/Type-4 from a large corpora of Java source code within seconds.
1. Download elasticsearch-2.2.0 and extract to disk.
mkdir ~/siamese
cd ~/siamese
wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.2.0/elasticsearch-2.2.0.tar.gz
tar -xvf elasticsearch-2.2.0.tar.gz
rm elasticsearch-2.2.0.tar.gz
2. Modify the configuration file in config/elasticsearch.yml
cd elasticsearch-2.2.0
vim config/elasticsearch.yml
Add the following lines at the end of the file. Save and quit.
cluster.name: stackoverflow
index.query.bool.max_clause_count: 4096
3. Clone the project from GitHub.
cd ~/siamese
git clone https://github.com/UCL-CREST/Siamese.git
4. Install JDK and Maven
sudo apt-get install default-jdk
sudo apt-get install maven
5. Check if you can call javac
.
javac
If javac
does not produce any results, your JAVA_HOME
is not set, set the JAVA_HOME by opening the file /etc/environment
vim /etc/environment
and paste the location of JAVA_HOME at the end of the file. You can locate JAVA_HOME by
whereis javac
ls -l <the path>
... keep following the path until you find the real path (not a symlink) to the javac
5. Modify the location of elasticsearch in config.properties
.
elasticsearchLoc=/my/dir/elasticsearch-2.2.0
Save and quit.
cd Siamese
vim config.properties
6. Try starting the elasticsearch service
./elasticsearch-2.2.0/bin/elasticsearch
You should see elasticsearch execution log like this.
[2018-10-02 03:50:35,305][INFO ][node ] [Warlock] version[2.2.0], pid[27101], build[8ff36d1/2016-01-27T13:32:39Z]
[2018-10-02 03:50:35,305][INFO ][node ] [Warlock] initializing ...
[2018-10-02 03:50:35,658][INFO ][plugins ] [Warlock] modules [lang-expression, lang-groovy], plugins [], sites []
[2018-10-02 03:50:35,674][INFO ][env ] [Warlock] using [1] data paths, mounts [[/ (/dev/sda2)]], net usable_space [107.8gb], net total_space [202.6gb], spins? [no], types [ext4]
[2018-10-02 03:50:35,674][INFO ][env ] [Warlock] heap size [989.8mb], compressed ordinary object pointers [true]
[2018-10-02 03:50:36,919][INFO ][node ] [Warlock] initialized
[2018-10-02 03:50:36,919][INFO ][node ] [Warlock] starting ...
[2018-10-02 03:50:36,982][INFO ][transport ] [Warlock] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2018-10-02 03:50:36,989][INFO ][discovery ] [Warlock] stackoverflow/VPfoqhukSoiP7RtKKgvYmg
[2018-10-02 03:50:40,037][INFO ][cluster.service ] [Warlock] new_master {Warlock}{VPfoqhukSoiP7RtKKgvYmg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2018-10-02 03:50:40,063][INFO ][http ] [Warlock] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2018-10-02 03:50:40,064][INFO ][node ] [Warlock] started
[2018-10-02 03:50:40,101][INFO ][gateway ] [Warlock] recovered [0] indices into cluster_state
Then, kill the process (Ctrl+C) and start the elasticsearch engine as a background service (with -d
flag).
./elasticsearch-2.2.0/bin/elasticsearch -d
You can also test that elasticsearch is running in the background by issuing the command below.
curl -XGET 'localhost:9200/_cat/indices?v&pretty'
You should see the output like this, which means there is no index in elasticsearch yet.
health status index pri rep docs.count docs.deleted store.size pri.store.size
7. Create an executable jar and copy to the Siamese home directory
cd Siamese
mvn compile package
cp -i target/siamese-0.0.*.jar .
8. Try to execute Siamese.
java -jar siamese-0.0.6-SNAPSHOT.jar
9. You will see how to execute Siamese printed on the screen.
$ java -jar siamese-0.0.6-SNAPSHOT.jar
usage: \(v 0.6\) $java -jar siamese.jar -cf <config file> [-i input] [-o
output] [-c command] [-h help]
Example: java -jar siamese.jar -cf config.properties
Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o
/my/output/dir -c index
-c,--command <arg> [optional] command to execute [index, search].
This will override the configuration file.
-cf,--configFile <arg> [* requried *] a configuration file
-h,--help <optional> print help
-i,--inputFolder <arg> [optional] location of the input files \(for
index or query\). This will override the
configuration file.
-o,--outputFolder <arg> [optional] location of the search result file.
This will override the configuration file.
10. An example of running Siamese to index a project “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties
11. Then, tell Siamese to search for clones of “bar” in the index of “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties
12. After Siamese finishes its execution, the output file (clone classes) will be located at /my/output/dir
.
The file will be using the pattern data_qr_<timestamp>.xml
.
13. If you want to enforce similarity threshold on the search results,
modify the config.properties
file to enable fuzzywuzzy or tokenratio (recommended) similarity.
Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.
computeSimilarity : tokenratio
simThreshold : 50%,50%,50%,50%
Siamese: Siamese executable can be downloaded here: Siamese v. 0.6. Please make sure you have Java 8 installed on your machine.
1. To execute Siamese, unzip the file and follow the steps below:
$cd siamese
$./elasticsearch-2.2.0/bin/elasticsearch -d
$java -jar siamese-0.0.5-SNAPSHOT.jar
Then you’ll see the usage and example of how to use Siamese.
usage: (v 0.5) $java -jar siamese.jar -cf <config file> [-i input] [-o output] [-c command] [-h help]
Example: java -jar siamese.jar -cf config.properties
Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o /my/output/dir -c index
-c,--command <arg> [optional] command to execute [index, search].
This will override the configuration file.
-cf,--configFile <arg> [* requried *] a configuration file
-h,--help <optional> print help
-i,--inputFolder <arg> [optional] location of the input files (for
index or query). This will override the
configuration file.
-o,--outputFolder <arg> [optional] location of the search result file.
This will override the configuration file.
2. An example of running Siamese to index a project “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties
3. Then, tell Siamese to search for clones of “bar” in “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties
4. After Siamese finishes its execution, the output file (clone classes) will be located at /my/output/dir
.
The file will be using the pattern data_qr_<timestamp>.xml
.
5. If you want to enforce similarity threshold on the search results,
modify the config.properties
file to enable fuzzywuzzy or tokenratio (recommended) similarity.
Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.
computeSimilarity : tokenratio
simThreshold : 50%,50%,50%,50%
BigCloneEval: BigCloneEval is a tool for automated recall evaluation based on BigCloneBench data set. It can be downloaded from: BigCloneBench
How to read the results Siamese search results include multiple parts: (1) file path, (2) method name, (3) starting and ending line.
For example, a clone pair of 10_so/299495_0.java_paintComponent#22#26
and mattibal/meshnet/MeshNetBase/src/com/mattibal/meshnet/utils/color/gui/LabChooserJFrame.java_paintComponent#89#95
means the method painComponent
in the file 10_so/299495_0.java
from line number 22 to 26 is a clone of the method paintComponent
in the file mattibal/meshnet/MeshNetBase/src/com/mattibal/meshnet/utils/color/gui/LabChooserJFrame.java
from line 89 to line 95.
If you have any questions or find any issues, please contact Chaiyong Ragkhitwetsagul at cragkhit [at] gmail [dot] com
or Jens Krinke at j.krinke [at] ucl [dot] ac [dot] uk
.