IT Shared: IT4BI: Distributed and Large-Scale Business Intelligence

The first semester of the second year is devoted to specialization, and there are 3 possible choices in the IT4BI program. One of them is "Distributed and Large-Scale Business Intelligence" and it is delivered by the DIMA group at the Technical University of Berlin. It is the specialization of my choice, so I'm happy to share my experience about it in this post.

This is the second post about the IT4BI curriculum, if you haven't read the first one yet, go here: IT4BI First Year: Business Intelligence Fundamentals.

I'll present the courses in the same way as in the previous post: first, I'll copy the description from the IT4BI website, then I'll list things we study, and then add a line or two about projects and exams.

Disclaimer: I am a part of the 2nd generation of IT4BI, so it applies to the academic year of 2014/2015, and might not be accurate for later years.

TUB, 3rd semester

So the following classes are offered at TU Berlin:

Big Data Analytics Seminar
Implementation of a Database Engine
Heterogeneous and Distributed Information Systems
Large Scale Data Analysis and Data Mining
Big Data Analytics Projects
Humanities: Interdisciplinary Communication
Machine Learning (Optional)

At TU Berlin IT4BI students can choose between Heterogeneous and Distributed Information Systems or Large Scale Data Analysis and Data Mining. Additionally, it's possible to attend Machine Learning classes, but without credit.

Big Data Analytics Seminar

Description: Participants of this seminar will acquire knowledge about recent research results and trends in the analysis of web-scale data. Through the work in this seminar, students will learn the comprehensive preparation and presentation of a research topic in this field. In order to achieve this, students will get to read and categorize a scientific paper, conduct background literature research and present as well as discuss their findings.

Each of us is assigned a paper and a mentor, we have to work through the paper, analyze its upsides and downsides, and then we have to deliver two presentations about it as well as a written report.

Some of the papers from 2014:

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations (pdf)
SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets (pdf)
SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures (pdf)
Fast Personalized PageRank on MapReduce (pdf)
Full list

Some of the papers from 2013:

Upper and lower bounds on the cost of a map-reduce computation (pdf)
Making queries tractable on big data with preprocessing (pdf)
Jet: An Embedded DSL for High Performance Big Data Processing (pdf)
Only Aggressive Elephants are Fast Elephants (pdf)
Full list

Note that papers were quite different in 2013 and 2014, so there's a high chance that the list will change in the next years. But still, it should roughly give you the idea what kind of papers are studied during this course.

There is no exam for this course and no separate course project, just a presentation about the paper at the end of the course.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/imsem/ (in German)

Implementation of a Database Engine

Description: In this lab course students will learn how to implement components of a query processor with focus on complex queries as they occur in data warehouses and OLAP. Students will create a working SQL query processor that can answer a set of basic analytical queries.

This course is about implementing things we learned during our Database System Architecture classes at ULB on our first semester (you can read more about it here).

The things we have to implement:

Basic IO
ARC Cache
Buffer Pool manager
Indexing, B-Tree
Operators (Table scan, Index scan; Filter, Nested loop join, Merge sort join, Group by)
Optimizers (Join order, access path selector)
Map-Reduce

Almost every week we get a new assignment, implement it in Java and submit it to an automated test system that gives you up to 10 points. We don't have to implement everything from scratch, they provide us with some backbone code, so we need to implement certain interfaces and call certain API methods. Yet it is quite challenging and quite time-consuming.

There's no exam for this course, and the entire course is a course project.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/idb-pra/

Heterogeneous and Distributed Information Systems

Description: In this course students will gain conceptual, methodological and practical knowledge about the development and integration of modern distributed, heterogeneous information systems based on the concepts of model integration, data integration, promotion of information systems and metadata management for Business Intelligence.

Note: I selected the Large Scale Data Analysis course instead of this, so the information might not be 100% accurate.

The topics covered in the class:

Foundations/Terminology of HDIS (FDBS, FIS, MBIS)
Dimensions of HDIS: Distribution, Heterogeneity, Autonomy
Heterogeneous Data Models in HDIS: structured, semistructured, unstructured
Distributed Data Organisation and Software Architectures of HDIS (FIS, P2P, CS, ...)
Interoperability and Middleware Platforms for HDIS
Persistency Services
Metadata Standards and Management in HDIS
Model-based Development of HDIS
Applications from Industry and Public Services

The class is given in a seminar way: students are assigned different topics which they learn themselves and then present to others. The topics that nobody choses are presented by the teacher. There is is a course project, and this semester it was about designing a web-based system for choosing a movie while also considering the best way to reach the cinema theater as well as suggesting a good restaurant nearby to go afterwards. There is no exam at the end.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/aim-1/ (in German)

Large Scale Data Analysis and Data-Mining

Description: The focus of this module is to get familiar with different parallel processing platforms and paradigms and to understand their feasibility for different kinds of data mining problems. For that students will learn how to adapt popular data mining and standard machine learning algorithms such as: Naive Bayes, K-Means clustering, PageRank, Alternate Least Squares, or other methods of text mining, graph analysis, or recommender systems to scalable processing paradigms. Students will subsequently gain practical experience in how to analyse Big Data on parallel processing platforms such as the Apache Big Data Stack (Hadoop, Giraph, etc.), the Berkeley Big Data Analytics Stack (e.g., Spark) and Apache Flink.

Topics covered:

Hadoop MapReduce
Joins in MapReduce
Apache Spark, Apache Flink
Math refresher: Linear Algebra, some Probability and Statistics
Classification: Naive Bayes, Ensembles, Logistic Regression, SVM
Stochastic Gradient Descent
Clustering: K-Means in MapReduce, BFR, Cure
Recommender Systems
Dimensionality Reduction, SVD
Graph Mining, PageRank, Pregel
Large Scale Statistical Natural Language Processing
Online Learning / Stream Processing
Privacy and Legal Issues
Visualization Analytics

All topics are discussed in the context of processing data on a large scale. This course is not a replacement for Data Mining course that we had at UFRT during our second semester, and it's different from the Machine Learning classes at TUB - because this course deals with scalable algorithms that aren't covered in other IT4BI courses.

We had 4 assignments during the course:

Hadoop assignment
Apache Flink assignment
Math refresher, implementing matrix multiplication in Flink
implementing Naive Bayes in Flink for text classification

We also have a project on any related topic of our choice that we have to defend by presenting a poster about it, and there's also an oral exam at the end. The exam was about the project and about some topics we studied.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/aim-3/ (German)
Github: https://github.com/sscdotopen/aim3

Big Data Analytics Projects

Description: In this course students will learn to systematically analyze a current issue in the information management area and to develop and implement a problem-oriented solution as part of a team.

During this course we had to choose a paper in Graph processing and implement it in groups in Apache Flink using the Scala programming language.

We could select papers on the following topics:

Structural diversity measured through number of (strongly or weakly) connected components
User engagement measured through k-core decomposition
Importance of a node measured through centrality
Spatio-Temporal Dynamics of Online Memes
Social Contagion
Minimum Spanning Forest and Single Source Shortest Path
Graph Coloring
Approximate Maximum Weight Matching

We meet each week and discuss problems that we encounter during implementation of the algorithm. Everything is organized through github: we create issues, send pull-requests with changes, etc.

Here you can find links to the papers that we had to read in order to implement the algorithms. There's no exam at the end and the entire course is a big project, so it mostly involved coding (and working with git).

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/impro-3/ (in German)

Humanities: Interdisciplinary Communication

Description: In this course students will learn to work on their authentic presentation and what kind of possibilities of intervention they have working in groups (knowledge about role identification and group dynamics). This workshop brings into focus multi-cultural aspects of communication.

The humanities component of the 3rd semester is quite different at TUB since it's not an language class. This course is held during weekends, and there are 3 weekends in total: one in November, one in December and one in January. At the end there's an oral exam.

Link: http://www.hi-gh.eu/index.php?page=inter-komm&hl=en_GB

Machine Learning

Description: In the lecture, introductory topics in the field of machine learning are presented. After the lecture, the learnt methods are revisited and exercises from the previous week are explained in the exercise session.

Note that the description is not from the IT4BI page - since officially it's not the part of the IT4BI program at TU Berlin. But still they offer this course to let us dive deeper into the machine learning topic.

The teacher of this course is Prof. Klaus-Robert Muller - the head of the Machine Learning department at TU Berlin. He is quite famous in the Machine Learning world, as his group was one of the pioneers in the Kernel Theory.

So, here's the content of the class:

Bayes Decision Theory
Maximum Likelihood Estimation and Bayes Learning
Principal Component Analysis
Independent Component Analysis
Fisher Discriminant Analysis
Clustering, k-means
Expectation Maximization
k-nearest Neighbor
Model Selection
Learning Theory and Kernel Methods
Support Vector Machines
Kernel Ridge Regression and Gaussian Processes
Neural Networks
Boltzmann Machines

There are no projects and since we cannot take this course for credits, there is no point in taking the exam (and probably it's not possible either). However, every week there is a homework assignment to reinforce learned things. The course is quite challenging, but if you're motivated, it is certainly worth the effort.

Link: https://wiki.ml.tu-berlin.de/wiki/Main/WS14_MaschinellesLernen1

There's also a second part of the course in the second semester that studies more advanced things in Machine Leaning:

Locally Linear Embedding
t-Distributed Stochastic Neighbor Embedding
Stationary Subspace Analysis
Dictionary Learning and Autoencoders
Canonical Correlation Analysis
Relevant Dimensionality Estimates
Hidden Markov Models
Kernel Methods for Structured Data
Bioinformatics, Structured Output, MKL
One-Class SVMs
Neural Networks for Structured Data

The second part reinforces the first part and dives deeper into Unsupervised Learning: the first half of the course is devoted to a broad set of Dimensionality Reduction techniques.

Database Internals & Scalable Data Processing

Lastly, the students of our generation had an opportunity to attend the lectures on Database internals by Prof. Volker Markl, the head of the DIMA group. But nobody actually chose to take it since it's mostly a repetition of the Database Systems Architecture class that we have at ULB. Still, it might be useful to know that it's also offered - for those who would like to refresh their knowledge on the database internals.

Link: http://www.dima.tu-berlin.de/menue/studium_und_lehre/masterstudium/idb/

7 comments:

Adetayo17/4/15 08:17
Hi Alexey,

Thanks for the post, It's really informative. I was wondering if any of your colleagues got to do the Business Intelligence as a Service specialization and as similae write-up such as this. I will be nice to read something about it from someone that offered it.

Thanks again.
Unknown22/4/15 14:57
Hi Adetayo, my colleagues are working on an article about this specialization
Adetayo22/4/15 15:56
Hello Alexey,

Thanks for the feedback. I'll be looking forward to it.
Nestor9/10/15 04:04
Hi Alexey,

Thanks for a great post, really helped me understand the subjects studied in the programme. I'm very interested in taking this master's. Do you know anything about the scholarships offered? As there doesn't seem to be much info on the website.
Olivia20/11/15 14:53
Hi Alexey,

Its really great post, I am so excited to apply this program :)

	Ahmet Anıl Pala
	Alexey Grigorev
	Andrés Vivanco Villamar
	Andres Felipe Zamora Montaño
	Elena Samota
	Guven Toprakkiran
	Hicham Akaoka Badssi
	José Luis Pino López
	Madalina Burghelea
	Maximiliano Ariel López
	Mia Johnson Vioulès
	Navid Mahlouji
	Nyami Ronald Mitterand
	Steffi Melinda
	Stephany García Martínez
	Tamara Mendt

7 January 2015

IT4BI: Distributed and Large-Scale Business Intelligence