26 December 2015

Codeforces Submissions: Dataset for Source Code Analysis

Codeforces Submissions Dataset

I wanted to do some analysis on source code, and I needed a dataset where code snippets are labeled with the programming language they are in. I scraped this data from codeforces.com, which is a website for holding programming contests. In this post, I share this data.

tl;dr Scroll down to get the links.

Business Intelligence in the Non-Profit Sector

Beyond any shadow of a doubt, a sufficient amount of correct, relevant, concise and up-to-date information is a key input in any decision-making process. This not only applies to profit-driven organisations but it is also relevant for the non-profit sector.

For instance, in a non-profit organisation, having access to membership information of good quality and in an efficient way is of utmost importance at the moment of defining membership strategies. Furthermore, good information is also crucial when it comes to translating strategies into tactics and, subsequently, turning the latter into action on the operational landscape.

17 December 2015

Test-Driven Machine Learning

Test-Driven Machine Learning

The book “Test-Driven Machine Learning” by Justin Bozonier, published by Packt Publishing, is in print now. I was a technical reviewer of this book, and in this post you will learn some details about it. The book is available on the publisher’s website as well as on Safari Books Library.

19 October 2015

Data Science Interview Questions

Data Science Interview Questions

Source: Data Science: An Introduction

Our IT4BI Master studies finished, and the next logical step after graduation is finding a job. I was interested in Data Science jobs and this post is a summary of my interview experience and preparation.

The term “Data Science” is not yet well establish, so interviews for Data Science jobs might include a very broad range of questions, depending on the interpretation of the term by a particular company. In this post I attempt to organize Data Science interview questions in some usable form, but it might also be biased by how I see Data Science myself. I hope you also can find it useful.

18 October 2015

Java Interview Questions

Java Interview Questions

In the past, I was a Software Developer, and my primary programming language was Java. I also quite often interviewed people and also sometimes was an interviewee. In this post, I would like to share typical questions that you might expect at a job interview for a Java Developer position.

15 October 2015

Mastering Data Analysis with R

The book "Mastering Data Analysis with R" by Gergely Daróczi, published by Packt Publishing, is in print now, and I had a pleasure to be a technical reviewer of this book.

If you're a Data Scientist who's looking to master R, this book is a good choice. It's already available on the publisher's website and on Safari books online.

11 June 2015

Recognition Award for IT4BI

The French Institute for Research in Computer Science and Automation (INRIA) has awarded an ongoing IT4BI Master Thesis Project with the prize "Prix spécial du jury" in the context of the competition "Boost Your Code 2015".

The award was given for ElectioVis, an open source decision-aiding software tool that I started to develop as part of the IT4BI Decision Support and BI specialisation I am pursuing at École Centrale Paris. The project profits from the academic advisory of Prof. Valentina Ferretti, an expert in the decision-making field who lectures at École Centrale Paris and Polytechnic University of Turin.

ElectioVis is a website that aims to bring the power of decision-making closer to all citizens of the world, overcoming economic, social, cognitive and language barriers. It will be available online during June 2015 and everyone will be able to try it for free.

6 June 2015

The Four Fundamental Subspaces

The Four Fundamental Subspaces

This is a first blog post in the series “Fundamental Theorem of Linear Algebra”, where we are working through Gilbert Strang’s paper “The fundamental theorem of linear algebra” published by American Mathematical Monthly in 1993.

In this post, we will go through the first two parts of the Fundamental Theorem: the dimensionality and the orthogonality of the Fundamental Subspaces.

Original Strang’s Diagram from the paper.

The Fundamental Theorem of Linear Algebra by G. Strang

The Fundamental Theorem of Linear Algebra

This is a series of articles devoted to Gilbert Strang’s Paper “The fundamental theorem of linear algebra” published by American Mathematical Monthly in 1993.

29 April 2015

How to count distinct?

Counting the number of distinct elements in a data set is a very common query. It can help give you an idea of how many duplicates you are dealing with. Let's say for example that you have a set of transactions, and you wish to detect if these transactions are either associated to a small set of frequent buying customers or performed by different customers. This can help you understand your clients and what type of marketing strategies you need to adopt.

23 April 2015

Apache Mahout Samsara: The Quick Start

Apache Mahout Samsara: The Quick Start

Last week the newest Apache Mahout 0.10 was released. One of the new features it has is a new math environment called “Samsara”, or Mahout Scala/Spark Bindings.

Samsara is a Linear Algebra library for Mahout. It’s written in Scala, which makes it possible to use operator overloading and it features nice R-like or Matlab-like syntax for basic Linear Algebra operations. For example, matrix multiplication is just X %*% Y. What is more, these operations can be distributed and run by an executing environment - currently by Apache Spark.

In this article we will see how to quickly set up a basic skeleton project and then we’ll try to do some very simple analysis on a 200 MB dataset.

9 March 2015

Naive Bayes on Apache Flink

In this blog post we are going to implement a Naive Bayes classifier in Apache Flink. We are going to use it for text classification by applying it to the 20 Newsgroup dataset. To understand what is going on, you should be familiar with Java and know what MapReduce is. If you have seen and understood a word count example in any system, you're good to go. If you haven't heard of MapReduce or haven't seen the word count, you may first have a look at our introductory post "Hadoop and MapReduce".

4 March 2015

Hadoop and MapReduce

In this article we will briefly discuss the computation paradigm MapReduce, and Apache Hadoop as one of its implementations. We won't get into much details, and we even won't implement the Word Count on Hadoop, but it should give some foundation for the future articles about tools for scalable data processing.

3 March 2015

The Dark Side of Entrepreneurship

"We will have more than a million clients and our company will be top leader in the industry over the next year". This is what every first time entrepreneur says at some point in time.

We often hear stories about young entrepreneurs who dropped school at a very young age and had a huge success. We look at these very few success stories and, as entrepreneurs, we lie to ourselves that one day we will be like them...

You normally recognise entrepreneurs as those who change jobs very frequently. They try a bit of everything and, in the end, they don’t get deep into any of the topics. They like to taste a bit of everything. They change countries, jobs and friends and it seems that, everywhere they land, they find something to do. They are proactive and extremely curious. They just don’t find their place in any of the traditional companies. They are dreamers and born sellers, even if they have to sell things not even they can imagine.

15 February 2015

Spring Batch Essentials

The book "Spring Batch Essentials" by Packt Publishing is in print now, and I had a pleasure to be a technical reviewer of this book.

Spring Batch is a tool for creating ETL ("Read/Process/Write") jobs: for batch processing large portions of enterprise data that requires sophisticated transformations and involves complex business logic. It gives you a possibility to manage jobs easily, supports transactions and allows job execution to be scaled to process large volumes of data.

4 February 2015

Best Time to Learn Linear Algebra is Now!

Linear Algebra is a crucial prerequisite for many things, including Statistics, Data Mining, Machine Learning, Computer Vision, Image Processing and many many others, so it's very important to know the basics of Linear Algebra to understand more advanced concepts. For example, it's really helpful for our IT4BI studies, especially for the specialization at TU Berlin.

And the best time to learn Linear Algebra or refresh your knowledge about it is right now! At this moment there are a couple of nice MOOCs that have just started and a few more are about to start in the nearest future.

Even if you don't join right now, they should be available in the future for learning as self-paced versions. Additionally I would like to include my favorite video courses on Linear Algebra, they are also for learning at your own speed with no deadlines.

29 January 2015

What to think about when coming up with a BI startup idea

Recently, I did the exercise of creating a business plan, following the single requirement that the business should be focused on Business Intelligence (BI). Almost all businesses nowadays are data driven, i.e. they incorporate some form of BI to drive the company decisions. However, the restriction for us was that our businesses had to deliver information, not only use it internally.

Now, even though I have been working in the area of BI for the past two years, coming up with a business idea from scratch was not an easy task. In this post I will share with you some of the learning outcomes of this experience, so that if you desire to start up your own BI focused business, you can benefit from what we learned.

28 January 2015


I quite often receive questions about the IT4BI program: my CV is on the program's website, and people see my email there and write me. Also, I get quite a lot of questions from social networks. I usually try my best to give answers to all of them, but many questions repeat again and again, so it's a good idea to create a FAQ and make the answers are available to everybody, so I don't have to repeat myself. A fairly large part of the questions are about the admission process, and I already addressed some of them in IT4BI: How To.

24 January 2015

IT4BI: How To

In the post European Master Programs in Data Analysis I listed some Master programs that I myself applied to. There I also briefly outlined how to apply to them. In this post, I would like to give more details about the process of applying to the program of my choice: Erasmus Mundus IT4BI. This is mostly based on my experience and I would like to share this with you.

Additionally, I sometimes get questions by email about the program and quite a lot of them are about the process: documents, motivation letters, etc. In this post I will also address these questions, and spare myself the troubles of typing the same text over and over again :)

15 January 2015

European Master Programs in Data Analysis

Some time ago, in 2012, I decided that I wanted to continue my education and get a Master's Degree. I wanted to get more involved into data analysis and related things like machine learning or data mining. Secondly, this had to be a program in Europe because I didn't want to travel too far - back then I lived in Poland. And lastly, I wanted to find a program with both tuition waivers and monthly allowances to cover living expenses.

I started actively looking for programs that met these criteria and now I would like to share my experience. In this blog post I list the most interesting programs (to me) and I include only ones that I myself applied to. I will not describe the programs/scholarships in details, but instead will refer to links with information. However I will add some things that I think are important: e.g. the process of applying, interesting details, etc.

7 January 2015

IT4BI: Distributed and Large-Scale Business Intelligence

The first semester of the second year is devoted to specialization, and there are 3 possible choices in the IT4BI program. One of them is "Distributed and Large-Scale Business Intelligence" and it is delivered by the DIMA group at the Technical University of Berlin. It is the specialization of my choice, so I'm happy to share my experience about it in this post.