IT Shared: IT4BI First Year: Business Intelligence Fundamentals

The first year of the IT4BI program is devoted to Business Intelligence fundamentals. There is plenty of information on the website of the program, including the course content page and the course description document. Even though these resources are comprehensive enough, I would still like to add a few details about the curriculum, so you can compare what is written on the website with the things we studied.

Also, I have had a couple of job interviews at which the interviewers asked me what exactly I study (not everybody knows what Business Intelligence is), so here is a list I can always refer to.

All courses are presented in the same way: first, I copy the description from the IT4BI website, then list things that we actually studied - the ones I remember, then describe the course project and the exam.

Note: I am a part of the 2nd generation of IT4BI, so it applies for the academic year of 2013/2014, and might not be accurate for later years.

ULB, 1st semester

The first semester is at Université Libre de Bruxelles, Brussels, Belgium. Here are the subjects we studied:

Data Warehouses
Business Process Management
Decision Engineering
Advanced Databases
Database Systems Architecture
French

Data Warehouses

Description: This course introduces the concepts and techniques necessary for designing, implementing, exploiting, and maintaining data warehouses. More precisely, this includes multidimensional databases and data warehouses, OLAP, reporting and ETL processes.

What we studied:

E/R Models
OLAP Cubes, basic operations on them (slicing, dicing, etc)
Dimension modeling: Conceptual models for DWH
Logical modeling: Star and snowflake schemas
Slowly Changing Dimensions, accounting for changes in DWH
Performance: View materialization and Indexing
ETL: Extract, Transform, Load, populating DWH with data
Reporting
Introduction to Data Mining
We also had several invited speakers from companies like IBM and Teradata

The practical part was with the Microsoft BI Stack (MS SQL Server, SSAS, SSIS): we explored data cubes and created ETL packages.

There was no specific topic for the course project, we could do it in several areas: e.g. Indexing, Data Mining. It was also possible to take some DWH software, investigate what it can do and build a small project using it. Additionally, we could propose our own topics: In my case, I took "MapReduce for Data Warehousing".

The exam was open book and we were allowed to use computers with all course materials (no online access). It was quite short (like 3 hours) and wasn't very hard.

Course page: http://cs.ulb.ac.be/public/teaching/infoh419

Business Process Management

Description: This course introduces basic concepts for modeling and implementing business processes using contemporary information technologies and standards, such as Business Process Modeling Notation (BPMN) and Business Process Execution Languages (BPEL).

What we studied:

Petri Nets
Workflow Nets - Petri nets for workflows
Mathematical properties, soundness
YAWL, more expressive language for workflows
Workflow Patterns, e.g. deferred choice, cancellation regions, etc
some BPeL, although we never used it
BPMN
Process Mining: the alpha algorithm, region-based process mining and process mining based on genetic algorithms

For this course we had 3 assignments: implement a workflow with PetriNets, with YAWL (quite painful!) and with BPML.

We could select a topic for the course project from a list of possible topics. That could be checking out some BPM software and implementing some workflows. My team did a project about process mining: we looked into the Alpha+ algorithm and the software that implements it.

For this course we had an oral exam: we had 3 questions and some time to prepare for each and we had to orally defend each answer of the question.

Course page: http://cs.ulb.ac.be/public/teaching/infoh420

Decision Engineering

Note: according to the IT4BI page the course has undergone some changes, now it's called "Applied Operational Research". Here's the description: The goal of this course is to introduce some major chapters of operational research. The main aim is to illustrate how mathematical models and specific algorithms can be used to help decision makers facing complex problems (involving a large number of alternatives, multiple criteria, uncertain or risky outcomes, multiple decision makers, etc.).

What we studied:

Voting Theory: Voting mechanisms like Plurality Voting, Borda's rule, Condorcet's Rule, etc; some desired properties like Monotonicity, Independence to 3rd alternatives; and important theorems like May's theorem and Arrow's Impossibility theorem
Parlametary Allocation methods like Hamilton's and Jefferson's
Multi-Objective Optimization: Waste Utilization problem, methods like dominance and the ideal point
Multi-Criteria Decision Aid: modeling preferences, finding the best possible solution; methods like ELECTRE and PROMETHEE
Game Theory: Nash Equilibrium, Iterative Removal; Prisoner's Dilemma, Battle of the Sexes; Cournot and Bertrand Duopoly Models; the Median Voter Theorem
Decisions Under Risk and Uncertainty; Decision Trees (not the data mining ones!),
Inventory Management: EOQ Models

For practice we solved problems for reinforcing the theory, and the course project was about using PROMETHEE method on special software.

The exam was close-book, no electronic devices, for 3 hours. We had exercises linked to voting theory, from MCDA (proving some properties), from Game Theory and from Decision Under Risk (for this we had to build a decision tree).

For me it was the most interesting course in the 1st semester, but I guess not many of my colleagues will agree with me :)

No public course page, but you can try http://uv.ulb.ac.be/ with visiteur/visiteur (user/password), look for course "MATH-H-405".

Advanced Databases

Description: Today, databases are moving away from typical management applications, and address new application areas. For this, databases must consider (1) recent developments in computer technology, as the object paradigm and distribution, and (2) management of new data types such as spatial or temporal data. This course introduces the concepts and techniques of some innovative database applications.

What we studied:

Active databases: triggers and the like. Practice with MS SQL Server.
Temporal databases: queries on temporal data, e.g. temporal joins. Practice with MS SQL Server
Object-Oriented and Object-Relational Databases: some boring and already died standards on object databases. Practice with Linq (C#) and Oracle
Spatial Databases: spacial queries, practice in PostGIS (PostgreSQL)

The course project was about taking any DBMS (most people chose some NoSQL database) and describing what it can do and why it can be useful.

The exam was open-book, but no electronic devices allowed. It lasted for about 5 hours (extremely long), and for me it was the most difficult exam of the 1st semester.

Course page: http://cs.ulb.ac.be/public/teaching/infoh415

Database Systems Architecture

Description: In contrast to a typical introductory course in database systems where one learns to design and query relational databases, the goal of this course is to get a fundamental insight into the implementation aspects of database systems. In particular, we take a look under the hood of relational database management systems, with a focus on query and transaction processing. By having an in-depth understanding of the query-optimisation-and-execution pipeline, one becomes more proficient in administering DBMSs, and hand-optimising SQL queries for fast execution.

What we studied:

Query Processing Pipeline
Logical query plan: Relational Algebra, Translating SQL to relational algebra
Plan optimization: Conjunctive Queries for removing redundant joins; heuristics (like pushing projections)
Physical plan: Operators (joins: nested-loop joins, hash-join, sort-merge joins; union, intersection, difference)
Physical plan optimization: query size estimation, greedy algorithm for join ordering
Indexes: Dense, Sparse; B-Trees, Open-hashing indexes, extensible hashing, linear hashing
Multi-dimensional indexes: kd-trees, Quad trees, R-Trees, grid file index
Ensuring ACID: Crash recovery, database transaction logs (undo/redo logging), concurrency control, schedulers (lock-based, timestamp-based)

For the course project we had to implement External Multi-Way Merge Sort algorithm and evaluate its performance under different settings. The implementation was quite easy, but the evaluation, on the other hand, was quite time consuming.

We had two types of exercises: pen-and-paper most of the time and we also had a couple of labs where we were given a database implementation and had to change some code there (e.g. see how a B-Tree works and add some missing pieces of code there).

The exam was closed book, and mostly had the same content as our pen-and-paper exercises. Additionally we were asked to describe how R-Tree works.

Course page: http://cs.ulb.ac.be/public/teaching/infoh417

French

We also studied the French language. There were 3 groups, for A0, A1 and A2 levels.

UFRT, 2nd semester

The second semester was in Université Francois Rabelais, Blois, France. Unlike ULB, there are no public course web pages with information.

Here's the list of courses we had:

Advanced Data Warehousing
Knowledge Discovery and Data Mining
XML and Web Technologies
Information Retrieval
Business Intelligence Seminar
French, German or Spanish

All the exams at UFRT were 2 hours long and were quite easy (compared to ULB).

Advanced Data Warehousing

Description: The aim of this course is to complement the course Data Warehouses (Semester 1) in its study of database technology used in Business Intelligence. A particular focus is given on the problems posed by heterogeneous data integration and data quality on the one hand, and on leveraging OLAP workload on the other hand. Classical notions of data warehousing and OLAP are recalled and developed: architecture, ETL, conceptual and logical design, query processing and optimization. Advanced topics like query personalization and recommendation are introduced.

What we studied:

basically the same things as for Data Warehouses at ULB
additional topics: Data Quality, MDX queries

The practical part was different though: in the labs we used Talend Data Quality and Talend Open Studio, Pentaho Mondrian on top of MySQL and MonetDB.

The course project was also different: in groups of 4/5 people we did a project on FIFA World Cup. We had to build conceptual and logical schema, find data (and crawl it from webpages), assess its quality, do ETLs and then build reports. So it was more practical than the one at ULB, and I feel we learned more practical stuff here, but less theory than at ULB.

The exam was quite easy, and as far as I remember some exercises were based on the project.

Knowledge Discovery and Data Mining

Description: This course gives students a detailed understanding of the strengths and limitations of popular data mining techniques. It also allows students to understand the problems associated with the computational complexity issues in data mining.

What we learned:

Introduction: CRISP-DM process, Univariate/Bivariate analysis, data cleaning and transforming, sampling
Local Pattern Discovery: frequent patters, association rules, Apriori, Eclat
Mining sequential patterns: Apriori for sequences
Decision Trees: splitting using information gain, the issue of overfitting, pruning
Model evaluation: metric for ranking models (Spearman and Kendall's correlation), k-fold cross validation, ROC curves, Gain charts
Clustering: k-means, hierarchical clustering, DBSCAN
Perceptron, Logistic Regression, Neural Networks
Associative Rule based methods
Instance based methods: KNN (K nearest neighbors)
Naive Bayes and Introduction to graphical models (d-separation)

We had two types of practical sessions: one was pen-and-paper exercises on each topic and another was on the computer using IBM SPSS Modeller.

The course project was a lot of fun: it was about Link Prediction in social networks. We were given a social network graph, extracted some features from it and trained a classifier to predict if two nodes have a link (i.e. if two users of a social network know each other).

The exam was based on the pen-and-paper exercises that we had (on decision trees, frequent patterns, clustering and naive bayes), kind of open-book (we were allowed to bring only 2 pages with hand-written text) and no electronic devices allowed - even no calculator.

XML and Web Technologies

Description: The advent of the World Wide Web has given rise to multiple technologies and techniques for exchanging data on the Web. This course studies these technologies for understanding the theory underlying these technologies but also to understand in what scenarios a certain technology is applicable.

What we studied:

Introduction: semi-structured data model, trees, XML, namespaces
DTD and XLS schemas
Tree Automata for schema validation
XPath
XLST
Semantic web: RDF, RDFS, RDFS Plus, OWL
XQuery
Integrity constraints and how to validate them

The course was quite practical (expect for two lectures on validation), and the computer exercises were done using oXygen and Protégé. We had no course project for this course.

As for the exam, it was open-book, but we were allowed to use only slides. We had a couple of exercises on tree automata and then on schema, XPath, XSLT and RDF/RDFS. It wasn't hard except for one thing: we had to write everything by hand (completely pointless in my opinion).

Information Retrieval

Description: This course studies the processing, indexing, querying, organization and classification of textual documents. It also gives the foundations of natural language processing and its use in information retrieval. What we studied:

Introduction, indexing, inverted index
Indexing processing: tokenization, stop words removal, stemming
Vector space models: bag-of-words, term frequency, TD-IDF weighting, vector space similarity
Other methods: Boolean model, probabilistic ranking
Quality metrics: precision, recall, F-score
User-based ranking: personalization
Recommender systems: content-based, collaborative filtering
Markov ranking: Page Rank, Hubs and Authorities
NLP: spelling correction, edit distance
NLP: morphology, part of speech tagging, named entities
NLP: syntax, parse trees, Chomsky's hierarchy

We again had two types of exercises: pen-and-paper - calculating TF-IDF, recommendations, page rank and in the lab - indexing with Lucene, some labs with Standford NLP, plus a tool for parsing text data to produce parse trees.

The project was rather a software engineering project than an IR project. The task was to build a system for indexing the FIFA World Cup data warehouse, and for some documents that users can upload. Apart from indexing the documents, they needed to be retrieved in a personalized way, and we should be able to make recommendations. We used existing libraries (in our case Lucene, Standord NLP, Mahout) and just had to put them together.

On the exam we had to calculate TF-IDF for a small corpus, create parse trees and compute page rank.

Business Intelligence Seminar

Description: This course presents current trends and recent developments in the domain of BI. It is designed and jointly taught by all consortium partners (main and associated) and will involve guest speakers presenting their organization, the three specializations, research topics, internships, and Masters’ thesis subjects for the second year of the master.

We could select from one of the following topics:

Query recommendation and optimization
Measuring the quality of database queries
Social Network Based Recommendation with Temporal Dynamics
Leveraging parsing, multi-word expressions and multilingualism for information retrieval
Trajectory Data mining
Semantic technologies in real life
Among others

The whole course is a course project: in groups of 4/5 people we selected one of the topics, got a supervisor with knowledge about the topic and did a lot of reading about the topic and at the end presenting everything we learned. There was no exam from this course.

Humanities

Depending on our choice of specialization, we studied French, German or Spanish. But it was also possible to take another language (say, German, even though you were going to Spain) or take two of them at the same time.

Second Year

That was out first year. For the second year we have a specialization at École Centrale Paris, Technische Universität Berlin, or Universitat Politècnica de Catalunya (Barcelona). The specialization of my choice was the one at TU Berlin and I describe it in the post IT4BI: Distributed and Large-Scale Business Intelligence.

	Ahmet Anıl Pala
	Alexey Grigorev
	Andrés Vivanco Villamar
	Andres Felipe Zamora Montaño
	Elena Samota
	Guven Toprakkiran
	Hicham Akaoka Badssi
	José Luis Pino López
	Madalina Burghelea
	Maximiliano Ariel López
	Mia Johnson Vioulès
	Navid Mahlouji
	Nyami Ronald Mitterand
	Steffi Melinda
	Stephany García Martínez
	Tamara Mendt

IT Shared

26 December 2014

IT4BI First Year: Business Intelligence Fundamentals