7 January 2015

IT4BI: Distributed and Large-Scale Business Intelligence


The first semester of the second year is devoted to specialization, and there are 3 possible choices in the IT4BI program. One of them is "Distributed and Large-Scale Business Intelligence" and it is delivered by the DIMA group at the Technical University of Berlin. It is the specialization of my choice, so I'm happy to share my experience about it in this post.

This is the second post about the IT4BI curriculum, if you haven't read the first one yet, go here: IT4BI First Year: Business Intelligence Fundamentals.

I'll present the courses in the same way as in the previous post: first, I'll copy the description from the IT4BI website, then I'll list things we study, and then add a line or two about projects and exams.

Disclaimer: I am a part of the 2nd generation of IT4BI, so it applies to the academic year of 2014/2015, and might not be accurate for later years.

TUB, 3rd semester

So the following classes are offered at TU Berlin:
  • Big Data Analytics Seminar
  • Implementation of a Database Engine
  • Heterogeneous and Distributed Information Systems
  • Large Scale Data Analysis and Data Mining
  • Big Data Analytics Projects
  • Humanities: Interdisciplinary Communication
  • Machine Learning (Optional)

At TU Berlin IT4BI students can choose between Heterogeneous and Distributed Information Systems or Large Scale Data Analysis and Data Mining. Additionally, it's possible to attend Machine Learning classes, but without credit.

Big Data Analytics Seminar

Description: Participants of this seminar will acquire knowledge about recent research results and trends in the analysis of web-scale data. Through the work in this seminar, students will learn the comprehensive preparation and presentation of a research topic in this field. In order to achieve this, students will get to read and categorize a scientific paper, conduct background literature research and present as well as discuss their findings.

Each of us is assigned a paper and a mentor, we have to work through the paper, analyze its upsides and downsides, and then we have to deliver two presentations about it as well as a written report.

Some of the papers from 2014:
  • Summingbird: A Framework for Integrating Batch and Online MapReduce Computations (pdf)
  • SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets (pdf)
  • SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures (pdf)
  • Fast Personalized PageRank on MapReduce (pdf)
  • Full list

Some of the papers from 2013:
  • Upper and lower bounds on the cost of a map-reduce computation (pdf)
  • Making queries tractable on big data with preprocessing (pdf)
  • Jet: An Embedded DSL for High Performance Big Data Processing (pdf)
  • Only Aggressive Elephants are Fast Elephants (pdf)
  • Full list

Note that papers were quite different in 2013 and 2014, so there's a high chance that the list will change in the next years. But still, it should roughly give you the idea what kind of papers are studied during this course.

There is no exam for this course and no separate course project, just a presentation about the paper at the end of the course.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/imsem/ (in German)

Implementation of a Database Engine

Description: In this lab course students will learn how to implement components of a query processor with focus on complex queries as they occur in data warehouses and OLAP. Students will create a working SQL query processor that can answer a set of basic analytical queries.

This course is about implementing things we learned during our Database System Architecture classes at ULB on our first semester (you can read more about it here).

The things we have to implement:
  • Basic IO
  • ARC Cache
  • Buffer Pool manager
  • Indexing, B-Tree
  • Operators (Table scan, Index scan; Filter, Nested loop join, Merge sort join, Group by)
  • Optimizers (Join order, access path selector)
  • Map-Reduce

Almost every week we get a new assignment, implement it in Java and submit it to an automated test system that gives you up to 10 points. We don't have to implement everything from scratch, they provide us with some backbone code, so we need to implement certain interfaces and call certain API methods. Yet it is quite challenging and quite time-consuming.

There's no exam for this course, and the entire course is a course project.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/idb-pra/

Heterogeneous and Distributed Information Systems

Description: In this course students will gain conceptual, methodological and practical knowledge about the development and integration of modern distributed, heterogeneous information systems based on the concepts of model integration, data integration, promotion of information systems and metadata management for Business Intelligence.

Note: I selected the Large Scale Data Analysis course instead of this, so the information might not be 100% accurate.

The topics covered in the class:
  • Foundations/Terminology of HDIS (FDBS, FIS, MBIS)
  • Dimensions of HDIS: Distribution, Heterogeneity, Autonomy
  • Heterogeneous Data Models in HDIS: structured, semistructured, unstructured
  • Distributed Data Organisation and Software Architectures of HDIS (FIS, P2P, CS, ...)
  • Interoperability and Middleware Platforms for HDIS
  • Persistency Services
  • Metadata Standards and Management in HDIS
  • Model-based Development of HDIS
  • Applications from Industry and Public Services

The class is given in a seminar way: students are assigned different topics which they learn themselves and then present to others. The topics that nobody choses are presented by the teacher. There is is a course project, and this semester it was about designing a web-based system for choosing a movie while also considering the best way to reach the cinema theater as well as suggesting a good restaurant nearby to go afterwards. There is no exam at the end.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/aim-1/ (in German)

Large Scale Data Analysis and Data-Mining

Description: The focus of this module is to get familiar with different parallel processing platforms and paradigms and to understand their feasibility for different kinds of data mining problems. For that students will learn how to adapt popular data mining and standard machine learning algorithms such as: Naive Bayes, K-Means clustering, PageRank, Alternate Least Squares, or other methods of text mining, graph analysis, or recommender systems to scalable processing paradigms. Students will subsequently gain practical experience in how to analyse Big Data on parallel processing platforms such as the Apache Big Data Stack (Hadoop, Giraph, etc.), the Berkeley Big Data Analytics Stack (e.g., Spark) and Apache Flink.

Topics covered:
  • Hadoop MapReduce
  • Joins in MapReduce
  • Apache Spark, Apache Flink
  • Math refresher: Linear Algebra, some Probability and Statistics
  • Classification: Naive Bayes, Ensembles, Logistic Regression, SVM
  • Stochastic Gradient Descent
  • Clustering: K-Means in MapReduce, BFR, Cure
  • Recommender Systems
  • Dimensionality Reduction, SVD
  • Graph Mining, PageRank, Pregel
  • Large Scale Statistical Natural Language Processing
  • Online Learning / Stream Processing
  • Privacy and Legal Issues
  • Visualization Analytics

All topics are discussed in the context of processing data on a large scale. This course is not a replacement for Data Mining course that we had at UFRT during our second semester, and it's different from the Machine Learning classes at TUB - because this course deals with scalable algorithms that aren't covered in other IT4BI courses.

We had 4 assignments during the course:
  • Hadoop assignment
  • Apache Flink assignment
  • Math refresher, implementing matrix multiplication in Flink
  • implementing Naive Bayes in Flink for text classification
We also have a project on any related topic of our choice that we have to defend by presenting a poster about it, and there's also an oral exam at the end. The exam was about the project and about some topics we studied.

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/aim-3/ (German)
Github: https://github.com/sscdotopen/aim3

Big Data Analytics Projects

Description: In this course students will learn to systematically analyze a current issue in the information management area and to develop and implement a problem-oriented solution as part of a team.

During this course we had to choose a paper in Graph processing and implement it in groups in Apache Flink using the Scala programming language.

We could select papers on the following topics:
  • Structural diversity measured through number of (strongly or weakly) connected components
  • User engagement measured through k-core decomposition
  • Importance of a node measured through centrality
  • Spatio-Temporal Dynamics of Online Memes
  • Social Contagion
  • Minimum Spanning Forest and Single Source Shortest Path
  • Graph Coloring
  • Approximate Maximum Weight Matching

We meet each week and discuss problems that we encounter during implementation of the algorithm. Everything is organized through github: we create issues, send pull-requests with changes, etc.

Here you can find links to the papers that we had to read in order to implement the algorithms. There's no exam at the end and the entire course is a big project, so it mostly involved coding (and working with git).

Link: http://www.dima.tu-berlin.de/menue/teaching/masterstudium/impro-3/ (in German)

Humanities: Interdisciplinary Communication

Description: In this course students will learn to work on their authentic presentation and what kind of possibilities of intervention they have working in groups (knowledge about role identification and group dynamics). This workshop brings into focus multi-cultural aspects of communication.

The humanities component of the 3rd semester is quite different at TUB since it's not an language class. This course is held during weekends, and there are 3 weekends in total: one in November, one in December and one in January. At the end there's an oral exam.

Link: http://www.hi-gh.eu/index.php?page=inter-komm&hl=en_GB

Machine Learning

Description: In the lecture, introductory topics in the field of machine learning are presented. After the lecture, the learnt methods are revisited and exercises from the previous week are explained in the exercise session.

Note that the description is not from the IT4BI page - since officially it's not the part of the IT4BI program at TU Berlin. But still they offer this course to let us dive deeper into the machine learning topic.

The teacher of this course is Prof. Klaus-Robert Muller - the head of the Machine Learning department at TU Berlin. He is quite famous in the Machine Learning world, as his group was one of the pioneers in the Kernel Theory.

So, here's the content of the class:
  • Bayes Decision Theory
  • Maximum Likelihood Estimation and Bayes Learning
  • Principal Component Analysis
  • Independent Component Analysis
  • Fisher Discriminant Analysis
  • Clustering, k-means
  • Expectation Maximization
  • k-nearest Neighbor
  • Model Selection
  • Learning Theory and Kernel Methods
  • Support Vector Machines
  • Kernel Ridge Regression and Gaussian Processes
  • Neural Networks
  • Boltzmann Machines

There are no projects and since we cannot take this course for credits, there is no point in taking the exam (and probably it's not possible either). However, every week there is a homework assignment to reinforce learned things. The course is quite challenging, but if you're motivated, it is certainly worth the effort.

Link: https://wiki.ml.tu-berlin.de/wiki/Main/WS14_MaschinellesLernen1

There's also a second part of the course in the second semester that studies more advanced things in Machine Leaning:
  • Locally Linear Embedding
  • t-Distributed Stochastic Neighbor Embedding
  • Stationary Subspace Analysis
  • Dictionary Learning and Autoencoders
  • Canonical Correlation Analysis
  • Relevant Dimensionality Estimates
  • Hidden Markov Models
  • Kernel Methods for Structured Data
  • Bioinformatics, Structured Output, MKL
  • One-Class SVMs
  • Neural Networks for Structured Data

The second part reinforces the first part and dives deeper into Unsupervised Learning: the first half of the course is devoted to a broad set of Dimensionality Reduction techniques.

Database Internals & Scalable Data Processing

Lastly, the students of our generation had an opportunity to attend the lectures on Database internals by Prof. Volker Markl, the head of the DIMA group. But nobody actually chose to take it since it's mostly a repetition of the Database Systems Architecture class that we have at ULB. Still, it might be useful to know that it's also offered - for those who would like to refresh their knowledge on the database internals.

Link: http://www.dima.tu-berlin.de/menue/studium_und_lehre/masterstudium/idb/

7 comments:

  1. Hi Alexey,

    Thanks for the post, It's really informative. I was wondering if any of your colleagues got to do the Business Intelligence as a Service specialization and as similae write-up such as this. I will be nice to read something about it from someone that offered it.

    Thanks again.

    ReplyDelete
  2. Hi Adetayo, my colleagues are working on an article about this specialization

    ReplyDelete
  3. Hello Alexey,

    Thanks for the feedback. I'll be looking forward to it.

    ReplyDelete
  4. Hi Alexey,

    Thanks for a great post, really helped me understand the subjects studied in the programme. I'm very interested in taking this master's. Do you know anything about the scholarships offered? As there doesn't seem to be much info on the website.

    ReplyDelete
    Replies
    1. Check the page "Tuition Fees & Scholarships" http://it4bi.univ-tours.fr/home/students/tuition-fees-scholarships/

      Delete
  5. Hi Alexey,

    Its really great post, I am so excited to apply this program :)

    ReplyDelete