# Codeforces Submissions Dataset

I wanted to do some analysis on source code, and I needed a dataset where code snippets are labeled with the programming language they are in. I scraped this data from codeforces.com, which is a website for holding programming contests. In this post, I share this data.

tl;dr Scroll down to get the links.

## Codeforces

On codeforces.com participants are given some programming problems, and they need to solve them using any language the platform supports. It’s possible to solve the problems using a lot of languages, including C/C++, Java, Python and others. Codeforces.com is a very good website for whose who want to lean algorithms and programming: not only the problems are fun to solve, but also you can see how the problems were solved by others – that is, you can have a look at the source code of each submission to the platform.

It is also interesting for data analysis: each submission has some metadata with the source code, including the language used and the verdict (accepted/not accepted/compilation error). So it seemed like the best choice to get the dataset I needed.

## Scraping

I didn’t find direct links to get the source code for each submission. Codeforces has some API, but for some reasons I wasn’t able to use it to get the source code.

The only way to get the code was by using the website: after clicking on a submission, and it shows the code in a pop-up. Thus, to get the data I needed to emulate a web browser. For that wrote a small script in Java using Selenium, and you can see it here: CodeExtractor.java. Selenium is quite slow, so I used multi-threading to scrape data in parallel. Then the data was put to a MySQL database.

I extracted the following data:

• submission_id - the ID of the individual submission
• source - the source code of the submission
• status - the verdict of the submission (accepted, not accepted, compilation error, etc)
• language - the programming language which was used to make the submission
• problem - the ID of the problem

This is a MySQL dump, so you need MySQL to import this dataset.

## Tokenization

The raw source code is hard to work with, and to do something meaningful we first need to make it machine-processable. The easiest way to do it is to tokenize it – i.e. split the source code into tokens. For example, consider the following snippet in Python:

def greet(name):
print 'Hello', name
greet('Jack')


It can be represented as this list of tokens ["def", "greet", "(", "name", ")", ":", "print", "'Hello'", ",", "name", "greet", "(", "'Jack'", ")"].

This representation is easier for analysis of this dataset.

For tokenization I used StreamTokenizer from Java. It is pretty generic and can handle other programming languages, not just Java or syntactically similar languages (e.g. C/C++): it does a good job tokenizing Pascal, Python, Ruby, Haskell and others. You can see the code of the tokenizer here: BowFeatureExtractor.java.

(It may say that the checksum does not match for the, just ignore it – for some reason Java doesn’t gzip very well.)

You don’t need MySQL to use this version of the dataset.

## Some statistics

There are about 270k submissions in the dataset and C/C++ is the most popular choice (more than 90% of all submissions are C/C++):

Note that this is a very tiny subset of all codeforces submissions!

## Example

To see how you can use this dataset, I prepared a Jupyter notebook, where I build a simple Decision Tree model to determine the language of a submission.

You can see it here: codeforces-langdetect.ipynb.

First, I classify C/C++ vs not C/C++, and then I build a model to classify the rest. I don’t use all the tokens form the source code, only the keywords. The list of keywords is taken form Notepad++ – they are kept in a file called langs.model.xml.

Here’s the model I get for C/C++ vs the rest:

if contains('std'):
if contains('import'):
if contains('scan'):
return 'C/C++' # (1.000, 3/3 examples)
else: # doesn't contain 'scan'
return 'OTHER' # (1.000, 47/47 examples)
else: # doesn't contain 'import'
if contains('String'):
return 'C/C++' # (0.800, 4/5 examples)
else: # doesn't contain 'String'
return 'C/C++' # (0.999, 230230/230376 examples)
else: # doesn't contain 'std'
if contains('printf'):
if contains('String'):
return 'OTHER' # (0.992, 604/609 examples)
else: # doesn't contain 'String'
return 'C/C++' # (0.993, 12940/13029 examples)
else: # doesn't contain 'printf'
if contains('#include'):
return 'C/C++' # (1.000, 193/193 examples)
else: # doesn't contain '#include'
return 'OTHER' # (0.977, 28517/29181 examples)


You can see the other model for the rest of the languages in the notebook.

I hope you’ll find this dataset useful. If you do, and you run some interesting analysis, please let us know the results.

Stay tuned!