26 December 2014

IT4BI First Year: Business Intelligence Fundamentals

The first year of the IT4BI program is devoted to Business Intelligence fundamentals. There is plenty of information on the website of the program, including the course content page and the course description document. Even though these resources are comprehensive enough, I would still like to add a few details about the curriculum, so you can compare what is written on the website with the things we studied.

Also, I have had a couple of job interviews at which the interviewers asked me what exactly I study (not everybody knows what Business Intelligence is), so here is a list I can always refer to.

All courses are presented in the same way: first, I copy the description from the IT4BI website, then list things that we actually studied - the ones I remember, then describe the course project and the exam.

Note: I am a part of the 2nd generation of IT4BI, so it applies for the academic year of 2013/2014, and might not be accurate for later years.

ULB, 1st semester

The first semester is at Université Libre de Bruxelles, Brussels, Belgium. Here are the subjects we studied:
  • Data Warehouses
  • Business Process Management
  • Decision Engineering
  • Advanced Databases
  • Database Systems Architecture
  • French

    Data Warehouses

    Description: This course introduces the concepts and techniques necessary for designing, implementing, exploiting, and maintaining data warehouses. More precisely, this includes multidimensional databases and data warehouses, OLAP, reporting and ETL processes.

    What we studied:
    • E/R Models
    • OLAP Cubes, basic operations on them (slicing, dicing, etc)
    • Dimension modeling: Conceptual models for DWH
    • Logical modeling: Star and snowflake schemas
    • Slowly Changing Dimensions, accounting for changes in DWH
    • Performance: View materialization and Indexing
    • ETL: Extract, Transform, Load, populating DWH with data
    • Reporting
    • Introduction to Data Mining
    • We also had several invited speakers from companies like IBM and Teradata
    The practical part was with the Microsoft BI Stack (MS SQL Server, SSAS, SSIS): we explored data cubes and created ETL packages.

    There was no specific topic for the course project, we could do it in several areas: e.g. Indexing, Data Mining. It was also possible to take some DWH software, investigate what it can do and build a small project using it. Additionally, we could propose our own topics: In my case, I took "MapReduce for Data Warehousing".

    The exam was open book and we were allowed to use computers with all course materials (no online access). It was quite short (like 3 hours) and wasn't very hard.

    Course page: http://cs.ulb.ac.be/public/teaching/infoh419 

    Business Process Management

    Description: This course introduces basic concepts for modeling and implementing business processes using contemporary information technologies and standards, such as Business Process Modeling Notation (BPMN) and Business Process Execution Languages (BPEL).

    What we studied:
    • Petri Nets
    • Workflow Nets - Petri nets for workflows
    • Mathematical properties, soundness
    • YAWL, more expressive language for workflows
    • Workflow Patterns, e.g. deferred choice, cancellation regions, etc
    • some BPeL, although we never used it
    • BPMN
    • Process Mining: the alpha algorithm, region-based process mining and process mining based on genetic algorithms
    For this course we had 3 assignments: implement a workflow with PetriNets, with YAWL (quite painful!) and with BPML.

    We could select a topic for the course project from a list of possible topics. That could be checking out some BPM software and implementing some workflows. My team did a project about process mining: we looked into the Alpha+ algorithm and the software that implements it.

    For this course we had an oral exam: we had 3 questions and some time to prepare for each and we had to orally defend each answer of the question.

    Course page: http://cs.ulb.ac.be/public/teaching/infoh420 

    Decision Engineering

    Note: according to the IT4BI page the course has undergone some changes, now it's called "Applied Operational Research". Here's the description: The goal of this course is to introduce some major chapters of operational research. The main aim is to illustrate how mathematical models and specific algorithms can be used to help decision makers facing complex problems (involving a large number of alternatives, multiple criteria, uncertain or risky outcomes, multiple decision makers, etc.).

    What we studied:
    • Voting Theory: Voting mechanisms like Plurality Voting, Borda's rule, Condorcet's Rule, etc; some desired properties like Monotonicity, Independence to 3rd alternatives; and important theorems like May's theorem and Arrow's Impossibility theorem
    • Parlametary Allocation methods like Hamilton's and Jefferson's
    • Multi-Objective Optimization: Waste Utilization problem, methods like dominance and the ideal point
    • Multi-Criteria Decision Aid: modeling preferences, finding the best possible solution; methods like ELECTRE and PROMETHEE
    • Game Theory: Nash Equilibrium, Iterative Removal; Prisoner's Dilemma, Battle of the Sexes; Cournot and Bertrand Duopoly Models; the Median Voter Theorem
    • Decisions Under Risk and Uncertainty; Decision Trees (not the data mining ones!),
    • Inventory Management: EOQ Models
    For practice we solved problems for reinforcing the theory, and the course project was about using PROMETHEE method on special software.

    The exam was close-book, no electronic devices, for 3 hours. We had exercises linked to voting theory, from MCDA (proving some properties), from Game Theory and from Decision Under Risk (for this we had to build a decision tree).

    For me it was the most interesting course in the 1st semester, but I guess not many of my colleagues will agree with me :)

    No public course page, but you can try http://uv.ulb.ac.be/ with visiteur/visiteur (user/password), look for course "MATH-H-405".

    Advanced Databases

    Description: Today, databases are moving away from typical management applications, and address new application areas. For this, databases must consider (1) recent developments in computer technology, as the object paradigm and distribution, and (2) management of new data types such as spatial or temporal data. This course introduces the concepts and techniques of some innovative database applications.

    What we studied:
    • Active databases: triggers and the like. Practice with MS SQL Server.
    • Temporal databases: queries on temporal data, e.g. temporal joins. Practice with MS SQL Server
    • Object-Oriented and Object-Relational Databases: some boring and already died standards on object databases. Practice with Linq (C#) and Oracle
    • Spatial Databases: spacial queries, practice in PostGIS (PostgreSQL)
    The course project was about taking any DBMS (most people chose some NoSQL database) and describing what it can do and why it can be useful.

    The exam was open-book, but no electronic devices allowed. It lasted for about 5 hours (extremely long), and for me it was the most difficult exam of the 1st semester.

    Course page: http://cs.ulb.ac.be/public/teaching/infoh415

    Database Systems Architecture

    Description: In contrast to a typical introductory course in database systems where one learns to design and query relational databases, the goal of this course is to get a fundamental insight into the implementation aspects of database systems. In particular, we take a look under the hood of relational database management systems, with a focus on query and transaction processing. By having an in-depth understanding of the query-optimisation-and-execution pipeline, one becomes more proficient in administering DBMSs, and hand-optimising SQL queries for fast execution.

    What we studied:
    • Query Processing Pipeline
    • Logical query plan: Relational Algebra, Translating SQL to relational algebra
    • Plan optimization: Conjunctive Queries for removing redundant joins; heuristics (like pushing projections)
    • Physical plan: Operators (joins: nested-loop joins, hash-join, sort-merge joins; union, intersection, difference)
    • Physical plan optimization: query size estimation, greedy algorithm for join ordering
    • Indexes: Dense, Sparse; B-Trees, Open-hashing indexes, extensible hashing, linear hashing
    • Multi-dimensional indexes: kd-trees, Quad trees, R-Trees, grid file index
    • Ensuring ACID: Crash recovery, database transaction logs (undo/redo logging), concurrency control, schedulers (lock-based, timestamp-based)
    For the course project we had to implement External Multi-Way Merge Sort algorithm and evaluate its performance under different settings. The implementation was quite easy, but the evaluation, on the other hand, was quite time consuming.

    We had two types of exercises: pen-and-paper most of the time and we also had a couple of labs where we were given a database implementation and had to change some code there (e.g. see how a B-Tree works and add some missing pieces of code there).

    The exam was closed book, and mostly had the same content as our pen-and-paper exercises. Additionally we were asked to describe how R-Tree works.

    Course page: http://cs.ulb.ac.be/public/teaching/infoh417


      We also studied the French language. There were 3 groups, for A0, A1 and A2 levels.

      UFRT, 2nd semester

      The second semester was in Université Francois Rabelais, Blois, France. Unlike ULB, there are no public course web pages with information.

      Here's the list of courses we had:
      • Advanced Data Warehousing
      • Knowledge Discovery and Data Mining
      • XML and Web Technologies
      • Information Retrieval
      • Business Intelligence Seminar
      • French, German or Spanish
      All the exams at UFRT were 2 hours long and were quite easy (compared to ULB).

      Advanced Data Warehousing

      Description: The aim of this course is to complement the course Data Warehouses (Semester 1) in its study of database technology used in Business Intelligence. A particular focus is given on the problems posed by heterogeneous data integration and data quality on the one hand, and on leveraging OLAP workload on the other hand. Classical notions of data warehousing and OLAP are recalled and developed: architecture, ETL, conceptual and logical design, query processing and optimization. Advanced topics like query personalization and recommendation are introduced.

      What we studied:
      • basically the same things as for Data Warehouses at ULB
      • additional topics: Data Quality, MDX queries
      The practical part was different though: in the labs we used Talend Data Quality and Talend Open Studio, Pentaho Mondrian on top of MySQL and MonetDB.

      The course project was also different: in groups of 4/5 people we did a project on FIFA World Cup. We had to build conceptual and logical schema, find data (and crawl it from webpages), assess its quality, do ETLs and then build reports. So it was more practical than the one at ULB, and I feel we learned more practical stuff here, but less theory than at ULB.

      The exam was quite easy, and as far as I remember some exercises were based on the project.

      Knowledge Discovery and Data Mining

      Description: This course gives students a detailed understanding of the strengths and limitations of popular data mining techniques. It also allows students to understand the problems associated with the computational complexity issues in data mining.

      What we learned:
      • Introduction: CRISP-DM process, Univariate/Bivariate analysis, data cleaning and transforming, sampling
      • Local Pattern Discovery: frequent patters, association rules, Apriori, Eclat
      • Mining sequential patterns: Apriori for sequences
      • Decision Trees: splitting using information gain, the issue of overfitting, pruning
      • Model evaluation: metric for ranking models (Spearman and Kendall's correlation), k-fold cross validation, ROC curves, Gain charts
      • Clustering: k-means, hierarchical clustering, DBSCAN
      • Perceptron, Logistic Regression, Neural Networks
      • Associative Rule based methods
      • Instance based methods: KNN (K nearest neighbors)
      • Naive Bayes and Introduction to graphical models (d-separation)
      We had two types of practical sessions: one was pen-and-paper exercises on each topic and another was on the computer using IBM SPSS Modeller.

      The course project was a lot of fun: it was about Link Prediction in social networks. We were given a social network graph, extracted some features from it and trained a classifier to predict if two nodes have a link (i.e. if two users of a social network know each other).

      The exam was based on the pen-and-paper exercises that we had (on decision trees, frequent patterns, clustering and naive bayes), kind of open-book (we were allowed to bring only 2 pages with hand-written text) and no electronic devices allowed - even no calculator.

      XML and Web Technologies

      Description: The advent of the World Wide Web has given rise to multiple technologies and techniques for exchanging data on the Web. This course studies these technologies for understanding the theory underlying these technologies but also to understand in what scenarios a certain technology is applicable.

      What we studied:
      • Introduction: semi-structured data model, trees, XML, namespaces
      • DTD and XLS schemas
      • Tree Automata for schema validation
      • XPath
      • XLST
      • Semantic web: RDF, RDFS, RDFS Plus, OWL
      • XQuery
      • Integrity constraints and how to validate them
      The course was quite practical (expect for two lectures on validation), and the computer exercises were done using oXygen and Protégé. We had no course project for this course.

      As for the exam, it was open-book, but we were allowed to use only slides. We had a couple of exercises on tree automata and then on schema, XPath, XSLT and RDF/RDFS. It wasn't hard except for one thing: we had to write everything by hand (completely pointless in my opinion).

      Information Retrieval

      Description: This course studies the processing, indexing, querying, organization and classification of textual documents. It also gives the foundations of natural language processing and its use in information retrieval. What we studied:
      • Introduction, indexing, inverted index
      • Indexing processing: tokenization, stop words removal, stemming
      • Vector space models: bag-of-words, term frequency, TD-IDF weighting, vector space similarity
      • Other methods: Boolean model, probabilistic ranking
      • Quality metrics: precision, recall, F-score
      • User-based ranking: personalization
      • Recommender systems: content-based, collaborative filtering
      • Markov ranking: Page Rank, Hubs and Authorities
      • NLP: spelling correction, edit distance
      • NLP: morphology, part of speech tagging, named entities
      • NLP: syntax, parse trees, Chomsky's hierarchy
      We again had two types of exercises: pen-and-paper - calculating TF-IDF, recommendations, page rank and in the lab - indexing with Lucene, some labs with Standford NLP, plus a tool for parsing text data to produce parse trees.

      The project was rather a software engineering project than an IR project. The task was to build a system for indexing the FIFA World Cup data warehouse, and for some documents that users can upload. Apart from indexing the documents, they needed to be retrieved in a personalized way, and we should be able to make recommendations. We used existing libraries (in our case Lucene, Standord NLP, Mahout) and just had to put them together.

      On the exam we had to calculate TF-IDF for a small corpus, create parse trees and compute page rank.

      Business Intelligence Seminar

      Description: This course presents current trends and recent developments in the domain of BI. It is designed and jointly taught by all consortium partners (main and associated) and will involve guest speakers presenting their organization, the three specializations, research topics, internships, and Masters’ thesis subjects for the second year of the master.

      We could select from one of the following topics:
      • Query recommendation and optimization
      • Measuring the quality of database queries
      • Social Network Based Recommendation with Temporal Dynamics
      • Leveraging parsing, multi-word expressions and multilingualism for information retrieval
      • Trajectory Data mining
      • Semantic technologies in real life
      • Among others
      The whole course is a course project: in groups of 4/5 people we selected one of the topics, got a supervisor with knowledge about the topic and did a lot of reading about the topic and at the end presenting everything we learned. There was no exam from this course.


      Depending on our choice of specialization, we studied French, German or Spanish. But it was also possible to take another language (say, German, even though you were going to Spain) or take two of them at the same time.

      Second Year

      That was out first year. For the second year we have a specialization at École Centrale Paris, Technische Universität Berlin, or Universitat Politècnica de Catalunya (Barcelona). The specialization of my choice was the one at TU Berlin and I describe it in the post IT4BI: Distributed and Large-Scale Business Intelligence.

      No comments:

      Post a Comment