26 December 2015

Codeforces Submissions: Dataset for Source Code Analysis

Codeforces Submissions Dataset

I wanted to do some analysis on source code, and I needed a dataset where code snippets are labeled with the programming language they are in. I scraped this data from codeforces.com, which is a website for holding programming contests. In this post, I share this data.

tl;dr Scroll down to get the links.

Codeforces

On codeforces.com participants are given some programming problems, and they need to solve them using any language the platform supports. It’s possible to solve the problems using a lot of languages, including C/C++, Java, Python and others. Codeforces.com is a very good website for whose who want to lean algorithms and programming: not only the problems are fun to solve, but also you can see how the problems were solved by others – that is, you can have a look at the source code of each submission to the platform.

It is also interesting for data analysis: each submission has some metadata with the source code, including the language used and the verdict (accepted/not accepted/compilation error). So it seemed like the best choice to get the dataset I needed.

Scraping

I didn’t find direct links to get the source code for each submission. Codeforces has some API, but for some reasons I wasn’t able to use it to get the source code.

The only way to get the code was by using the website: after clicking on a submission, and it shows the code in a pop-up. Thus, to get the data I needed to emulate a web browser. For that wrote a small script in Java using Selenium, and you can see it here: CodeExtractor.java. Selenium is quite slow, so I used multi-threading to scrape data in parallel. Then the data was put to a MySQL database.

I extracted the following data:

  • submission_id - the ID of the individual submission
  • source - the source code of the submission
  • status - the verdict of the submission (accepted, not accepted, compilation error, etc)
  • language - the programming language which was used to make the submission
  • problem - the ID of the problem

You can download it here:

This is a MySQL dump, so you need MySQL to import this dataset.

Tokenization

The raw source code is hard to work with, and to do something meaningful we first need to make it machine-processable. The easiest way to do it is to tokenize it – i.e. split the source code into tokens. For example, consider the following snippet in Python:

def greet(name):
    print 'Hello', name
greet('Jack')

It can be represented as this list of tokens ["def", "greet", "(", "name", ")", ":", "print", "'Hello'", ",", "name", "greet", "(", "'Jack'", ")"].

This representation is easier for analysis of this dataset.

For tokenization I used StreamTokenizer from Java. It is pretty generic and can handle other programming languages, not just Java or syntactically similar languages (e.g. C/C++): it does a good job tokenizing Pascal, Python, Ruby, Haskell and others. You can see the code of the tokenizer here: BowFeatureExtractor.java.

I already pre-tokenized the dataset, and you can download it here:

(It may say that the checksum does not match for the, just ignore it – for some reason Java doesn’t gzip very well.)

You don’t need MySQL to use this version of the dataset.

Some statistics

There are about 270k submissions in the dataset and C/C++ is the most popular choice (more than 90% of all submissions are C/C++):

Note that this is a very tiny subset of all codeforces submissions!

Example

To see how you can use this dataset, I prepared a Jupyter notebook, where I build a simple Decision Tree model to determine the language of a submission.

You can see it here: codeforces-langdetect.ipynb.

First, I classify C/C++ vs not C/C++, and then I build a model to classify the rest. I don’t use all the tokens form the source code, only the keywords. The list of keywords is taken form Notepad++ – they are kept in a file called langs.model.xml.

Here’s the model I get for C/C++ vs the rest:

if contains('std'):
  if contains('import'):
    if contains('scan'):
      return 'C/C++' # (1.000, 3/3 examples)
    else: # doesn't contain 'scan'
      return 'OTHER' # (1.000, 47/47 examples)
  else: # doesn't contain 'import'
    if contains('String'):
      return 'C/C++' # (0.800, 4/5 examples)
    else: # doesn't contain 'String'
      return 'C/C++' # (0.999, 230230/230376 examples)
else: # doesn't contain 'std'
  if contains('printf'):
    if contains('String'):
      return 'OTHER' # (0.992, 604/609 examples)
    else: # doesn't contain 'String'
      return 'C/C++' # (0.993, 12940/13029 examples)
  else: # doesn't contain 'printf'
    if contains('#include'):
      return 'C/C++' # (1.000, 193/193 examples)
    else: # doesn't contain '#include'
      return 'OTHER' # (0.977, 28517/29181 examples)

You can see the other model for the rest of the languages in the notebook.

I hope you’ll find this dataset useful. If you do, and you run some interesting analysis, please let us know the results.

Stay tuned!

No comments:

Post a Comment