Toxic code snippets on Stack Overflow
Ragkhitwetsagul C., Krinke J., Paixao M., Bianco G., Oliveto R.
Our empirical study of online code clones between 72,365 Java code snippets on Stack Overflow and 111 Java open source projects reveals toxic code snippets: 100 outdated and 214 potentially license-violating clone pairs.
Online Code Clones
We call code snippets that are copied from software systems to online Q&A websites (such as Stack Overflow) and vice versa as online code clones. There are two directions in creating online code clones: (1) code is cloned from a software project to a Q&A website as an example; or (2) code is cloned from a Q&A website to a software project to obtain a functionality, perform a particular task, or fixing a bug.
Toxic Code Snippets
Toxic code snippets mean code snippets that are harmful for reuse and, in several cases, are caused by online code cloning. We found that Stack Overflow code snippets originated from open source software or online sources can become toxic when they are (1) outdated or (2) violating their original software license.
Outdated code
Outdated code occurs when a piece of code has been copied from its origin to another location and later the original has been updated (Xia et al., 2014). Usually code clone detection is used to locate clone instances and update them to match with the originals. However, online code clones are more difficult to detect than in regular software projects due to its large search space and a mix of natural and programming languages combined in the same post.
Licensing violation
Code cloning can also have side effects of software license compatability. Carelessly cloning code from one project to another project with a different license may cause a software licensing violation (German et al., 2009). This also happens within the context of online Q&A websites such as Stack Overflow.
Examples of Toxic Code Snippets
1. The Hadoop's compare
method
The first example is outdated and license-violating online code clones in an
answer to a Stack Overflow question regarding how to implement
RawComparator
in Hadoop.
The figure below shows, on the left, a code snippet embedded as a part
of the accepted answer. The snippet shows how Hadoop implements the
compare
method in its WritableComparator
class. The code snippet on the right shows another version of the same method,
but at this time extracted from the latest version (as of October 3, 2017) of
Hadoop.
We can see that they both are highly similar except a line
containing buffer.reset(null,0,0);
which was added on November
21, 2014. The added line is intended for cleaning up the reference in the
buffer
variable and avoid excess heap usage
(issue no. HADOOP-11323).
/* Code in Stack Overflow post ID 22315734 (no license) */ public int compare(byte[] b1,int s1,int l1,byte[] b2,int s2,int l2) { try { buffer.reset(b1, s1, l1); // parse key1 key1.readFields(buffer); buffer.reset(b2, s2, l2); // parse key2 key2.readFields(buffer); } catch (IOException e) { throw new RuntimeException(e); } return compare(key1, key2); // compare them }
/* WritableComparator.java (2014-11-21) (Apache v.2.0 license) */ public int compare(byte[] b1,int s1,int l1,byte[] b2,int s2,int l2) { try { buffer.reset(b1, s1, l1); // parse key1 key1.readFields(buffer); buffer.reset(b2, s2, l2); // parse key2 key2.readFields(buffer); buffer.reset(null, 0, 0); // clean up reference } catch (IOException e) { throw new RuntimeException(e); } return compare(key1, key2); // compare them }
While this change has already been introduced into the
compare
method several years ago, the code example in Stack
Overflow post is still unchanged.
In addition, the original code snippet of
WritableComparator
class in Hadoop is distributed with Apache license
version 2.0 while its cloned instance on Stack Overflow contains only the
compare
method and ignores its license statement on top of the
file.
There are two potential issues for this. First, the code snippet may appear to be
under Stack Overflow's CC BY-SA 3.0 instead of its original Apache
license. Second, if the code snippet is copied and incorporated into another
software project with a conflicting license, a legal issue may arise.
2. The Hadoop's humanReadableInt
method
The second motivating example of a toxic code snippet with more
disrupting changes than the first one can be found in an answer to a Stack
Overflow question regarding how to format files sizes in a human readable form.
The figure below shows, on the left, a code snippet to perform the
task from the StringUtils
class in Hadoop.
The code snippet on the
right shows another version of the same method, but at this time extracted from
the latest version of Hadoop. We can see that
they are totally different. The humanReadableInt
method is
rewritten on February 5, 2013 to solve an issue of a race condition
(issue no. HADOOP-9252).
Similar to the first example, the clone code snippet on Stack Overflow does not
include its original Apache v.2.0 license.
/* Code in Stack Overflow post ID 801987 (no license) */ public static String humanReadableInt(long number) { long absNumber = Math.abs(number); double result = number; String suffix = ""; if (absNumber < 1024) { // nothing } else if (absNumber < 1024 * 1024) { result = number / 1024.0; suffix = "k"; } else if (absNumber < 1024 * 1024 * 1024) { result = number / (1024.0 * 1024); suffix = "m"; } else { result = number / (1024.0 * 1024 * 1024); suffix = "g"; } return oneDecimal.format(result) + suffix; }
/* StringUtils.java (2013-02-05) (Apache v.2.0 license) */ public static String humanReadableInt(long number) { return TraditionalBinaryPrefix.long2String(number,"",1); }