27 September 2016

A Beginner's Guide to Apache Flink – 12 Key Terms, Explained


Overview
In this post, I will go through 12 core Apache Flink concepts to better understand what it does and how it works. This article could perfectly serve as a beginner's overview of Flink and Streaming engine terminology.


1.      What is Apache Flink?

At first glance, the origins of Apache Flink can be traced back to June 2008 as a researching project of the Database Systems and Information Management (DIMA) Group at the Technische Universität (TU) Berlin in Germany.

Apache Flink is an open source platform for distributed stream and batch data processing, initially it was designed as an alternative to MapReduce and the Hadoop Distributed File System (HFDS) in Hadoop origins.

According to the Apache Flink project, it is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.”

What does Flink offer?

Streaming:
·       High Performance & Low Latency
·       Support for Event Time and Out-of-Order Events
·       Exactly-once Semantics for Stateful Computations
·       Highly flexible Streaming Windows
·       Continuous Streaming Model with Backpressure
·       Fault-tolerance via Lightweight Distributed Snapshots

Batch and Streaming in One System
·       One Runtime for Streaming and Batch Processing
·       Memory Management
·       Iterations and Delta Iterations
·       Program Optimizer

Ecosystem
·       Broad Integration (Yarn, Hadoop, HDFS, Kafka, others)

What are Flink’s components?

Flink stack offers application programming interfaces (APIs) (in Java/Scala/Python), shell console, tools and Libraries to develop new data-intensive applications over Flink engine.
















2.      Deploy Layer / What does Flink’s Deploy Layer do?

Apache Flink can execute programs in a diversity of context, such as standalone, or embedded in other programs. The execution could be in a local Java Virtual Machine JVM or in different clusters with many nodes.


3.      Core Layer (Runtime) / What does Flink’s Core Layer do?

The Distributed Streaming Dataflow layer receives a program like a generic parallel data flow with arbitrary tasks, which inputs and outputs are data streams. This tasks are called “JobGraph”.

4.      Stream Processing / What is stream processing?

Stream processing is a new sight of processing, the business logic is applied to every transaction that is being recorded in real-time in a system, for example E-commerce, the Internet of Things (IofT) with various sensors that emit data online, online monitoring of traffic in a city, telecom, banking, etc. In other words, stream processing applies business logic to each event that is being captured online in instead of store whole events and hence process it as a batch. The highlight of process in this way is that the analysis will show the real or online state of the data at this instance, that is in real-time with unbounded data.

5.      Batch Processing

Bath processing is processing a huge volume of bounded data at once. The steps need for processing are called batch jobs. Batch jobs could be stored up during working hours for example working hour and hence executed during the evening, or even during weeks or months and executed on weekend or once a month. The classic example of batch processing is how the credit card companies process billing. The client does not receive a bill for each transaction, usually a customer receives the billing each month when whole data has been collected. One example for managing it is Hadoop that provides map Reduce as a processing tool for these large scale files which can be months or years of data stored.


[1]












6.      Flink DataStream API (for Stream Processing)

Data Stream is the main API that offers Apache Flink, and what makes difference with its competitors. DataStream API allows develop programs (in Java, Scala and Python) that implement transformations on data streams (see examples in 6.1). The data streams are initially created from multiple sources such as message queues, socket streams or files. The results of the data streams return via Data Sinks, which allow write the data to distributed files or for example command line terminal.

6.1  Examples of transformations in Flink:

·       Map
·       FlatMap
·       Filter
·       KeyBy
·       Reduce
·       Fold
·       Aggregations
·       Window
·       WindowAll
·       Window Apply
·       Window Reduce
·       Window Fold
·       Aggregations on windows
·       Union
·       Window Join
·       Window CoGroup
·       Connect
·       CoMap, CoFlatMap
·       Split
·       Select
·       Iterate
·       Extract Timestamps
·       Project (for data streams of Tuples)

7.      Flink DataSet API (for Batch Processing)

Apache Flink provides a DataSet API that allows to developers can develop programs (in Java, Scala and Python), that implement transformations on data sets (see examples in 7.1). The data sets are initially created from multiple sources such as File-Based, Collection-Based, Socket-based, Custom. The results of the data sets return via Data Sinks, which allow write the data to distributed files or for example command line terminal.

7.1  Examples of transformations in Flink:

·       Map
·       FlatMap
·       Map Partition
·       Filter
·       Reduce
·       ReduceGroup
·       Aggregate
·       Distinct
·       Join
·       OuterJoin
·       CoGroup
·       Cross
·       Union
·       Rebalance
·       Hash-Partition
·       Range-Partition
·       Custom Partitioning
·       Sort Partition
·       First-n
·       Project (for data streams of Tuples)
·       MinBy / MaxBy (for data streams of Tuples)


8.      FlinkCEP – Complex event processing for Flink

Apache Flink includes a complex event processing library that allows to developers detect complex event patterns in a stream of endless data. With the analysis of Matching sequences data scientist can construct complex events and do a deep analysis of the data.



9.      Streaming SQL

Streaming SQL is a query language that extends the traditional SQL capabilities to process real-time data streams. The main challenge is incorporate aggregations, windows and time semantics on streams. Nowadays, Apache Flink community is working with Apache Calcite community to develop a new model for solve these challenges and improvements.


10.      Flink Table API & SQL

Flink Table API and SQL are experimental features focusing in Streaming SQL that allow to work with SQL-like expressions for relational stream and batch processing. The Table API and SQL interface operate on a relational Table abstraction and provide tight integration with Flink DataSet API.

11.      Flink ML (Machine Learning)

Apache Flink provides an extensive and newly scalable Machine Learning (ML) library for Flink developers, the main two goals of Flink ML are to help to developers keep glue code to a minimum and second goal is to make it easy to use providing detailed documentation with examples. Flink ML allows to data scientist to test their algorithms and models locally with a subset of the total data and hence with the same written code this can be executed at a much larger scale in a cluster setting.

The Machine learning algorithms supported at the moment are:

Supervised Learning
SVM using Communication efficient distributed dual coordinate ascent (CoCoA)
Multiple linear regression
Optimization Framework
Data Preprocessing
Polynomial Features
Standard Scaler
MinMax Scaler
Recommendation
Alternating Least Squares (ALS)
Utilities
Distance Metrics



12.      Flink Gelly (Graph Processing)

Gelly is a Graph API for Flink that contains variety of methods and tool (such as graph transformations and utilities, iterative graph processing, library of graph algorithms) for doing graph analysis applications in Flink. Gelly can be seamlessly mixed with the DataSet Flink API for developing programs that use both record-based and graph-based analysis.

Flink Gelly provide the next Graph Methods:
·       Graph Properties (e.g. getVertexIds, getEdgelds, numberOfVerices, numberOfEdgest, etc)
·       Transformations (e.g. map, filter, join, subgraph, union, difference, reverse, undirected, getTriplets, etc.)
·       Mutations (e.g. add vertex, add edge, remove vertex, remove edge)
·       Neighborhood Methods.


Some Algorithms provides in Flink Gelly:

·       PageRank
·       Single Source Shortest Paths
·       Label Propagation
·       Weakly Connected Components
·       Community Detection
·       Triangle Count & Enumeration
·       Graph Summarization


13.      BONUS term
Blink (Alibaba Flink)

Blink is a project (improvements to Flink) from Alibaba Group, which operates the world’s largest online marketplace and it profits bigger than Amazon and eBay combined. These improvements include better changes in Flink Table API (such as unification of SQL layer for batch and streaming, adding features stream-stream join, aggregations, windowing, retraction) and Runtime Compatibility with Flink API and Ecosystem (such as new runtime architecture on YARN, optimized state, checkpoint & failover, reliable and production quality and others).


Summary

Apache Flink could be considering as the 4th generation of Big Data Framework, instead of waiting for a long cycle of batch processing until data could be available, data scientist can work in real-time with data generated and processed continuously.

References:

[1] Batch Processing, http://www.itrelease.com/2012/12/what-are-advantages-and-disadvantages-of-batch-
processing-systems/ 





No comments:

Post a Comment