29 April 2015

How to count distinct?

Counting the number of distinct elements in a data set is a very common query. It can help give you an idea of how many duplicates you are dealing with. Let's say for example that you have a set of transactions, and you wish to detect if these transactions are either associated to a small set of frequent buying customers or performed by different customers. This can help you understand your clients and what type of marketing strategies you need to adopt.

23 April 2015

Apache Mahout Samsara: The Quick Start

Apache Mahout Samsara: The Quick Start

Last week the newest Apache Mahout 0.10 was released. One of the new features it has is a new math environment called “Samsara”, or Mahout Scala/Spark Bindings.

Samsara is a Linear Algebra library for Mahout. It’s written in Scala, which makes it possible to use operator overloading and it features nice R-like or Matlab-like syntax for basic Linear Algebra operations. For example, matrix multiplication is just X %*% Y. What is more, these operations can be distributed and run by an executing environment - currently by Apache Spark.

In this article we will see how to quickly set up a basic skeleton project and then we’ll try to do some very simple analysis on a 200 MB dataset.