Showing posts with label Apache Spark. Show all posts
Showing posts with label Apache Spark. Show all posts

27 September 2016

A Beginner's Guide to Apache Flink – 12 Key Terms, Explained


Overview
In this post, I will go through 12 core Apache Flink concepts to better understand what it does and how it works. This article could perfectly serve as a beginner's overview of Flink and Streaming engine terminology.


1.      What is Apache Flink?

At first glance, the origins of Apache Flink can be traced back to June 2008 as a researching project of the Database Systems and Information Management (DIMA) Group at the Technische Universität (TU) Berlin in Germany.

Apache Flink is an open source platform for distributed stream and batch data processing, initially it was designed as an alternative to MapReduce and the Hadoop Distributed File System (HFDS) in Hadoop origins.

According to the Apache Flink project, it is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.”

23 April 2015

Apache Mahout Samsara: The Quick Start

Apache Mahout Samsara: The Quick Start

Last week the newest Apache Mahout 0.10 was released. One of the new features it has is a new math environment called “Samsara”, or Mahout Scala/Spark Bindings.

Samsara is a Linear Algebra library for Mahout. It’s written in Scala, which makes it possible to use operator overloading and it features nice R-like or Matlab-like syntax for basic Linear Algebra operations. For example, matrix multiplication is just X %*% Y. What is more, these operations can be distributed and run by an executing environment - currently by Apache Spark.

In this article we will see how to quickly set up a basic skeleton project and then we’ll try to do some very simple analysis on a 200 MB dataset.