Toxic code snippets on Stack Overflow

Ragkhitwetsagul C., Krinke J., Paixao M., Bianco G., Oliveto R.

Our empirical study of online code clones between 72,365 Java code snippets on Stack Overflow and 111 Java open source projects reveals toxic code snippets: 100 outdated and 214 potentially license-violating clone pairs.

See methodology and findings

Online Code Clones

We call code snippets that are copied from software systems to online Q&A websites (such as Stack Overflow) and vice versa as online code clones. There are two directions in creating online code clones: (1) code is cloned from a software project to a Q&A website as an example; or (2) code is cloned from a Q&A website to a software project to obtain a functionality, perform a particular task, or fixing a bug.


Toxic Code Snippets

Toxic code snippets mean code snippets that are harmful for reuse and, in several cases, are caused by online code cloning. We found that Stack Overflow code snippets originated from open source software or online sources can become toxic when they are (1) outdated or (2) violating their original software license.

Outdated code

Outdated code occurs when a piece of code has been copied from its origin to another location and later the original has been updated (Xia et al., 2014). Usually code clone detection is used to locate clone instances and update them to match with the originals. However, online code clones are more difficult to detect than in regular software projects due to its large search space and a mix of natural and programming languages combined in the same post.

Licensing violation

Code cloning can also have side effects of software license compatability. Carelessly cloning code from one project to another project with a different license may cause a software licensing violation (German et al., 2009). This also happens within the context of online Q&A websites such as Stack Overflow.


Examples of Toxic Code Snippets

1. The Hadoop's compare method

The first example is outdated and license-violating online code clones in an answer to a Stack Overflow question regarding how to implement RawComparator in Hadoop. The figure below shows, on the left, a code snippet embedded as a part of the accepted answer. The snippet shows how Hadoop implements the compare method in its WritableComparator class. The code snippet on the right shows another version of the same method, but at this time extracted from the latest version (as of October 3, 2017) of Hadoop.

We can see that they both are highly similar except a line containing buffer.reset(null,0,0); which was added on November 21, 2014. The added line is intended for cleaning up the reference in the buffer variable and avoid excess heap usage (issue no. HADOOP-11323).

/* Code in Stack Overflow post ID 22315734 (no license) */
public int compare(byte[] b1,int s1,int l1,byte[] b2,int s2,int l2) {
  try {
    buffer.reset(b1, s1, l1); // parse key1
    key1.readFields(buffer);
    buffer.reset(b2, s2, l2); // parse key2
    key2.readFields(buffer);
  } catch (IOException e) {
    throw new RuntimeException(e);
  }
  return compare(key1, key2); // compare them
}
        
/* WritableComparator.java (2014-11-21) (Apache v.2.0 license) */
public int compare(byte[] b1,int s1,int l1,byte[] b2,int s2,int l2) {
  try {
    buffer.reset(b1, s1, l1);  // parse key1
    key1.readFields(buffer);
    buffer.reset(b2, s2, l2);  // parse key2
    key2.readFields(buffer);
    buffer.reset(null, 0, 0);  // clean up reference
  } catch (IOException e) {
    throw new RuntimeException(e);
  }
  return compare(key1, key2);  // compare them
}
        

While this change has already been introduced into the compare method several years ago, the code example in Stack Overflow post is still unchanged. In addition, the original code snippet of WritableComparator class in Hadoop is distributed with Apache license version 2.0 while its cloned instance on Stack Overflow contains only the compare method and ignores its license statement on top of the file.

There are two potential issues for this. First, the code snippet may appear to be under Stack Overflow's CC BY-SA 3.0 instead of its original Apache license. Second, if the code snippet is copied and incorporated into another software project with a conflicting license, a legal issue may arise.

2. The Hadoop's humanReadableInt method

The second motivating example of a toxic code snippet with more disrupting changes than the first one can be found in an answer to a Stack Overflow question regarding how to format files sizes in a human readable form. The figure below shows, on the left, a code snippet to perform the task from the StringUtils class in Hadoop.

The code snippet on the right shows another version of the same method, but at this time extracted from the latest version of Hadoop. We can see that they are totally different. The humanReadableInt method is rewritten on February 5, 2013 to solve an issue of a race condition (issue no. HADOOP-9252). Similar to the first example, the clone code snippet on Stack Overflow does not include its original Apache v.2.0 license.

/* Code in Stack Overflow post ID 801987 (no license) */
public static String humanReadableInt(long number) {
    long absNumber = Math.abs(number);
    double result = number;
    String suffix = "";
    if (absNumber < 1024) {
      // nothing
    } else if (absNumber < 1024 * 1024) {
      result = number / 1024.0;
      suffix = "k";
    } else if (absNumber < 1024 * 1024 * 1024) {
      result = number / (1024.0 * 1024);
      suffix = "m";
    } else {
      result = number / (1024.0 * 1024 * 1024);
      suffix = "g";
    }
    return oneDecimal.format(result) + suffix;
  }
            
/* StringUtils.java (2013-02-05) (Apache v.2.0 license) */
public static String humanReadableInt(long number) {
  return TraditionalBinaryPrefix.long2String(number,"",1);
}