Big Data Architecture Re-Invented

(Big-)Data Architecture (Re-)Invented
William El Kaim
Dec. 2016 – V3.3
This Presentation is part of the
Enterprise Architecture Digital Codex
Copyright © William El Kaim 2016 http://www.eacodex.com/ 2

Plan
• Taming The Data Deluge
• What is Big Data?
• Why Now?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• What is a Data Lake?
• What is BI on Hadoop?
• What is Big Data Analytics?
• Big Data Technologies
• Hadoop Distributions & Tools
• Hadoop Architecture Examples
Copyright © William El Kaim 2016 3
Taming the Data Deluge
Copyright © William El Kaim 2016 Source: Domo 4



Plan
• Why Now?
• What is Hadoop?
What is Big Data?
• A collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data
processing applications
• Due to its technical nature, the same challenges arise in Analytics at much lower
volumes than what is traditionally considered Big Data.
• Big Data Analytics is:
• The same as ‘Small Data’ Analytics, only with the added challenges (and potential) of
large datasets (~50M records or 50GB size, or more)
• Challenges :
• Data storage and management
• De-centralized/multi-server architectures
• Performance bottlenecks, poor responsiveness
• Increasing hardware requirements
Copyright © William El Kaim 2016 Source: SiSense 9

What is Big Data: the “Vs” to Nirvana
Visualization
Big Data: A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications
Big Data: When the data could not fit in Excel. Used to be 65,536 lines, Now 1,048,577.
Big Data: When it's cheaper to keep everything than spend the effort to decide what to throw
away (David Brower @dbrower)
Copyright © William El Kaim 2016 Source: James Higginbotham 11

Six V to Nirvana
Copyright © William El Kaim 2016 Source: Bernard Marr 12

Six V to Nirvana
Copyright © William El Kaim 2016 Source: IBM 13

Six V to Nirvana

Six V to Nirvana

Six V to Nirvana

Six V to Nirvana
Visualization
Copyright © William El Kaim 2016 Source: Bernard Marr 17

Big Data Challenges
• Data access, cleansing and categorization
• Data storage and management
• De-centralized/multi-server architectures management

• Performance bottlenecks, poor responsiveness, crash
• Hardware requirements
• (Big) Data management tools like Cluster mgt, multi-region deployment, etc.
• Non stable software distributions, rapid evolving fields
• Missing Skills

Plan
• Why Now?
• What is Hadoop?
Big Data: Why Now ?

Datafication of our World
Source: Bernard Marr

New Databases
Source: Robin Purohit

New Databases

Data Science (R)evolution
Source: Capgemini

Open APIs -> Platforms -> Bus.
Ecosystems

All in one Ready To Use Solutions
Copyright © William El Kaim 2016 http://web.panoply.io/ 26

Plan
• Why Now?
• What is Hadoop?
Hadoop Genesis
• Scalability issue when running jobs processing terabytes of data
• Could span dozen days just to read that amount of data on 1 computer
• Need lots of cheap computers
• To Fix speed problem
• But lead to reliability problems
• In large clusters, computers fail every day
• Cluster size is not fixed
• Need common infrastructure
• Must be efficient and reliable

Hadoop Genesis
“It is an important technique!”

“Map-reduce”
2004
Doug Cutting
Extended
Apache Nutch
Copyright © William El Kaim 2016 Source: Xiaoxiao Shi 29

Hadoop Genesis
“Hadoop came from the name my kid gave a stuffed yellow elephant.
Short, relatively easy to spell and pronounce, meaningless, and not
used elsewhere: those are my naming criteria.”
Doug Cutting, Hadoop project creator
• Open Source Apache Project

• Written in Java
• Running on Commodity hardware and all major OS
• Linux, Mac OS/X, Windows, and Solaris

Enters Hadoop the Big Data Refinery
• Hadoop is not replacing anything.
• Hadoop has become another
component in an organizations
enterprise data platform.
• Hadoop (Big Data Refinery) can
ingest data from all types of
different sources.
• Hadoop then interacts and has data
flows with traditional systems that
provide transactions and
interactions (relational databases)
and business intelligence and
analytic systems (data
warehouses).
Copyright © William El Kaim 2016 Source: DBA Journey Blog 31

Hadoop Platform
Copyright © William El Kaim 2016 Source: Octo Technology 32

Plan
• Why Now?
• What is Hadoop?
When to Use Hadoop?
Two Main Use Cases
2
Big Data
Analytics
and Use
1
Data Mgt.
And Storage

When to Use Hadoop? 1 Data Mgt and Storage
2 Data Analytics and Use
Source: Octo Technology

Hadoop for Data Mgt and Storage 1
ETL Pre-processor
ETL Pre-processor
• Shift the pre-processing of ETL in staging data warehouse to Hadoop
• Shifts high cost data warehousing to lower cost Hadoop clusters
Copyright © William El Kaim 2016 Source: Microsoft 36

Hadoop for Data Mgt and Storage 1
Massive Storage
Massive Storage
• Offloading large volume of historical data into cold storage with Hadoop
• Keep data warehouse for hot data to allow BI and analytics
• When data from cold storage is needed, it can be moved back into the warehouse

Hadoop for Data Analytics and Use 2
Six V to Nirvana to Provide
• Hindsight (what happened?)

• Oversight (what is happening?)
• Insight (why is it happening?)
• Foresight (what will happen?)

From Hindsight to Insight to Foresight

• Traditional analytics tools are not well suited to capturing the full value of big
data.
• The volume of data is too large for comprehensive analysis.
• The range of potential correlations and relationships between disparate data sources are
too great for any analyst to test all hypotheses and derive all the value buried in the data.
• Basic analytical methods used in business intelligence and enterprise reporting tools
reduce to reporting sums, counts, simple averages and running SQL queries.
• Online analytical processing is merely a systematized extension of these basic analytics
that still rely on a human to direct activities specify what should be calculated.


From Data Management To Data Driven Decisions

• Machine Learning: Data Reliability For Big Data
• In big data world bringing data together from multiple internal and external sources can
be a challenge. New approach to moving from manual or rules-based matching to
matching done via machine learning.
• The initial phase of machine learning being then to provide transparency of the actual
rules that drive the merge and then it is up to the user to evaluate the discovered rule
and persist it in the system.
• Graph: Finding Relationships In The Data
• Graph is used to helping understand and navigate many-to-many relationships across all
data entities.
• Cognitive Systems: Intelligent Recommendations
• Building intelligent systems that guide users and provide intelligent recommendations,
based on data and user behavior.
Copyright © William El Kaim 2016 Source: Reltio 41

From Data Management To Data Driven Decisions

• Collaborative Curation: Clean & Current Data
• Sharing data across all systems and functional groups helps realize the full value of data
collected. Marketing, sales, services, and support should all leverage the same reliable,
consolidated data. They should be able to collaborate and contribute to enriching the
data. They should be able to vote on data quality or the business impact of any data
entity. New data-driven applications must support this.
• Data Monetization & DaaS: New Revenue Streams
• Charter of a CDO is not only about data governance and data integration and
management. Increasingly, companies are asking CDOs to turn this data into new
revenue streams. With cloud-based data-as-a-service, companies can monetize their
data and become data brokers. Businesses can now collaborate with each other to
create common data resources and easily share or exchange data.
Copyright © William El Kaim 2016 Source: Reltio 42

Data Discovery
Data Discovery
• Keep data warehouse for operational BI and analytics
• Allow data scientists to gain new discoveries on raw data (no format or structure)
• Operationalize discoveries back into the warehouse

Skills Needed

When Not to Use Hadoop
• Real Time Analytics (Solved in Hadoop V2)
• Since Hadoop V1 cannot be used for real time analytics, people explored and developed
a new way in which they can use the strength of Hadoop (HDFS) and make the
processing real time. So, the industry accepted way is to store the Big Data in HDFS and
mount Spark over it. By using spark the processing can be done in real time and in a
flash (real quick). Apache Kudu is also a complementary solution to use.
• To Replace Existing Infrastructure
• All the historical big data can be stored in Hadoop HDFS and it can be processed and
transformed into a structured manageable data. After processing the data in Hadoop you
often need to send the output to other database technologies for BI, decision support,
reporting etc.
• Small Datasets
• Hadoop framework is not recommended for small-structured datasets as you have other
tools available in market which can do this work quite easily and at a fast pace than
Hadoop. For a small data analytics, Hadoop can be costlier than other tools.
Copyright © William El Kaim 2016 Source: Edureka 45

Plan
• Why Now?
• What is Hadoop?
How to Implement Hadoop?
Typical Project Development Steps
Unstructured Business CRM, ERP
Data
Transactions Web, Mobile
& Interactions Point of sale
Log files Big Data
Platform
Exhaust Data
Classic Data
Integration & ETL
Social Media
Sensors,
devices Business Dashboards,
Intelligence Reports,
& Analytics Visualization, …
DB data
Capture Big Data Process Distribute Results Feedback

1 Collect data from all sources
structured &unstructured
2
Transform, refine,
aggregate, analyze, report
3 Interoperate and share data
with applications/analytics
4 Use operational data w/in
the big data platform
Copyright © William El Kaim 2016 Source: HortonWorks 47

Typical Three Steps Process
Collect Explore Enrich

Typical Project Development Steps
Raw Data Data Lake Lakeshore Data Science Data

Bus. or Usage
Visualization 4. Data Output
Correlations,
oriented Analytics, & Bus. Layer
extractions Machine intelligence
Internal Data Learning
3. Data Processing /
Analysis Layer
External Data
2. Data Storage
Layer
1. Data Source
Layer
1 2 3 3 4

Dataiku DSS
Free Community Edition

(mac, linux, docker,
aws)
Copyright © William El Kaim 2016 http://www.dataiku.com/dss/ 50

Copyright © William El Kaim 2016 Source: Cirrus Shakeri 51

CRISP Methodology
Copyright © William El Kaim 2016 Source: The Modeling Agency and sv-europe 52
McKinsey Seven Steps Approach
Copyright © William El Kaim 2016 Source: McKinsey 53

IBM DataFirst

Plan
• Why Now?
• What is Hadoop?
EDW Current Limitations
• Organizations are realizing that traditional EDW technologies can’t meet their
new business needs
• Including leveraging streaming and social data from the Web or from connected devices
on the Internet of things (IoT)
• One major challenge with traditional EDWs is their schema-on-write
architecture
• Enterprises must design the data model and articulate the analytic frameworks before
loading any data.
• In other words, they need to know ahead of time how they plan to use that data.
Copyright © William El Kaim 2016 Source: Zaloni 56

Hadoop to the Rescue
• Extract and place data into a Hadoop-based repository without first
transforming the data the way they would for a traditional EDW.
• All frameworks are created in an ad hoc manner, with little or no prep work required.
• labor intensive processes of cleaning up data and developing schema are deferred until
a clear business need is identified
• Hadoop can be 10 to 100 times less expensive to deploy than traditional
data warehouse.

Enters the Data Lake
• A data lake is:
• Typically built using Hadoop.
• Supports structured or
unstructured data.
• Benefiting from a variety of
storage and processing tools to
extract value quickly.
• Requiring little or no processing
for adapting the structure to an
enterprise schema
• A central location in which to
store all your data in its native
form, regardless of its source or
format.

Who is Using Data Lakes?
Copyright © William El Kaim 2016 Source: PWC 59

Data Flow In The Data Lake

Schema On Read or Schema On Write?
• The Hadoop data lake concept can be summed up as, “Store it all in one
place, figure out what to do with it later.”
• But while this might be the general idea of your Hadoop data lake, you won’t
get any real value out of that data until you figure out a logical structure for it.
• And you’d better keep track of your metadata one way or another. It does no
good to have a lake full of data, if you have no idea what lies under the shiny
surface.
• At some point, you have to give that data a schema, especially if you want to
query it with SQL or something like it. The eternal Hadoop question is
whether to apply the brave new strategy of schema on read, or to stick with
the tried and true method of schema on write.
Copyright © William El Kaim 2016 Source: AdaptiveSystems 61

Schema On Write …
• Before any data is written in the database, the structure of that data is strictly
defined, and that metadata stored and tracked.
• Irrelevant data is discarded, data types, lengths and positions are all
delineated.
• The schema; the columns, rows, tables and relationships are all defined first for the
specific purpose that database will serve.
• Then the data is filled into its pre-defined positions. The data must all be cleansed,
transformed and made to fit in that structure before it can be stored in a process
generally referred to as ETL (Extract Transform Load).
• That is why it is called “schema on write” because the data structure is
already defined when the data is written and stored.
• For a very long time, it was believed that this was the only right way to
manage data.

Schema On Write …
• Benefits: Quality and Query Speed.
• Because the data structure is defined ahead of time, when you query, you know exactly
where your data is.
• The structure is generally optimized for the fastest possible return of data for the types of
questions the data store was designed to answer (write very simple SQL and get back
very fast answers).
• The answers received from querying data are sharply defined, precise and trustworthy,
with little margin for error.
• Drawbacks: Data Alteration & Query Limitations
• Data has been altered and structured specifically to serve a specific purpose. Chances
are high that, if another purpose is found for that data, the data store will not suit it well.
• ETL processes and validation rules are then needed to clean, de-dupe, check and
transform that data to match pre-defined format. Those processes take time to build,
time to execute, and time to alter if you need to change it to suit a different purpose.

Schema On Read …
• Revolutionary concept: “You don’t have to know what you’re going to do with
your data before you store it.”
• Data of many types, sizes, shapes and structures can all be thrown into the Hadoop
Distributed File System, and other Hadoop data storage systems.
• While some metadata needs to be stored, to know what’s in there, no need yet to know
how it will be structured!
• Therefore, the data is stored in its original granular form, with nothing thrown
away
• In fact, no structural information is defined at all when the data is stored.
• So “schema on read” implies that the schema is defined at the time the data
is read and used, not at the time that it is written and stored.
• When someone is ready to use that data, then, at that time, they define what pieces are
essential to their purpose, where to find those pieces of information that matter for that
purpose, and which pieces of the data set to ignore.

Schema On Read …
• Benefits: Flexibility and Query Power
• Because data is stored in its original form, nothing is discarded, or altered for a specific
purpose.
• Different types of data generated by different sources can be stored in the same place.
This allows you to query multiple data stores and types at once.
• The heart of the Hadoop data lake concept which puts all available data sets in their original
form in a single location such a potent one.
• Drawbacks: Inaccuracies and Slow Query Speed
• Since the data is not subjected to rigorous ETL and data cleansing processes, nor does
it pass through any validation, data may be riddled with missing or invalid data,
duplicates and a bunch of other problems that may lead to inaccurate or incomplete
query results.
• In addition, since the structure must be defined when the data is queried, the SQL
queries tend to be very complex. They take time to write, and even more time to
execute.

Some Examples in the Hadoop Ecosystem
• Drill is probably the best example of a pure schema on read SQL engine in
the Hadoop ecosystem today.
• It gives you the power to query a broad set of data, from a wide variety of different data
stores, including hierarchical data such as JSON and XML.
• Hive is the original schema on read technology, but is, in fact, a marvelous
hybrid of the two technologies. In order for Hive to gain the advantages of a
schema on write data store, ORC file format was created. This is a pre-
structured format optimized for Hive queries. By combining strategies, Hive
has gained many of the advantages of both camps.
• Spark SQL is entirely a schema on write technology, leveraging a data
processing engine that can do high speed ETL processes entirely in
memory.
• Other SQL on write options: Impala with Parquet, Actian Vortex with Vector in
Hadoop, IBM Big SQL with BigInsights, HAWQ with Greenplum, etc.

Schema On Read or Schema On Write?
• Schema on read options tend to be a better choice:
• for exploration, for “unknown unknowns,” when you don’t know what kind of questions
you might want to ask, or the kinds of questions might change over time.
• when you don’t have a strong need for immediate responses. They’re ideal for data
exploration projects, and looking for new insights with no specific goal in mind.
• Schema on write options tend to be very efficient for “known unknowns.”
• When you know what questions you’re going to need to ask, especially if you will need
the answers fast, schema on write is the only sensible way to go.
• This strategy works best for old school BI types of scenarios on new school big data
sets.

Data Lake vs. Enterprise Data Warehouse
Copyright © William El Kaim 2016 Source: Martin Fowler 68

Data Lake vs. Enterprise Data Warehouse

The Lakeshore Concept
• The data lake shouldn't be
accessed directly very much.
• Because the data is raw, you need
a lot of skill to make any sense of
it.
• Lakeshore
• Create a number of data marts
each of which has a specific
model for a single bounded
context.
• A larger number of downstream
users can then treat these
lakeshore marts as an
authoritative source for that
context.
Copyright © William El Kaim 2016 Source: Martin Fowler 71

Data Lake vs. EDW
Copyright © William El Kaim 2016 Source: Platfora 72

Data Lake vs. EDW

Data Lake Tools: Bistep
Copyright © William El Kaim 2016 Source: Bigstep 74

Data Lake Tools
Informatica Data Lake Management
• Informatica Data Lake Management now includes the following innovations
brought together with a single metadata driven platform:
• Informatica Enterprise Information Catalog enables business users to quickly discover and
understand all enterprise data using intelligent self-service tools powered by machine learning
and AI.
• Informatica Intelligent Streaming helps organizations more efficiently capture and process big
data (machine, social media feeds, website click stream, etc.) and real-time events to gain
timely insight for business initiatives, such as IoT, marketing and fraud detection.
• Informatica Blaze dramatically increases Hadoop processing performance with intelligent data
pipelining, job partitioning, job recovery and scaling powered by a unique cluster aware data
processing engine integrated with YARN.
• Informatica Secure@Source® reduces the risk of proliferation of sensitive and private data with
enhanced monitoring and alerting of sensitive data based on both risk conditions and user
access/activity.
• Informatica Big Data Management™ is easily and rapidly deployed in the cloud with the click of
a button through the Microsoft Azure Marketplace to integrate, govern, and secure big data at
scale in Hadoop.
• Informatica Cloud Microsoft Azure Data Lake Store Connector helps customers achieve faster
business insights by providing self-service connectivity to integrate and synchronize diverse
data sets into Microsoft Azure Data Lake Store.
Copyright © William El Kaim 2016 Source: Informatica 75

Data Lake Tools: Microsoft Azure

Data Lake Tools: Koverse
Copyright © William El Kaim 2016 Source: Koverse 77

Data Lake Tools: Platfora
Copyright © William El Kaim 2016 Source: Platfora 78

Big Data: Discovery
• Waterline Data Inventory: Automated data discovery platform. Foundation of
self service data preparation, analytics, and data governance in Hadoop.

Big Data: Discovery
Copyright © William El Kaim 2016 https://cloud.oracle.com/bigdatadiscovery 80

Data Lake Tool: Zaloni

Data Lake Tool: Zdata
Copyright © William El Kaim 2016 Source: Zdata 82

Data Lake Resources
• Introduction
• PWC: Data lakes and the promise of unsiloed data
• Zaloni: Resources
• Tools
• Bigstep
• Informatica Data Lake Mgt & Intelligent Data Lake
• Microsoft Azure Data Lake and Azure Data Catalog
• Koverse
• Oracle BigData Discovery
• Platfora
• Podium Data
• Waterline Data Inventory
• Zaloni Bedrock
• Zdata Data Lake

Plan
• Why Now?
• What is Hadoop?
What is BI on Hadoop?
Three Options
Copyright © William El Kaim 2016 Source: Dremio 85

BI on Hadoop
Three Options: ETL to Data Warehouse
• Pros
• Relational databases and their BI integrations are very mature
• Use your favorite tools
• Tableau, Excel, R, …
• Cons
• Traditional ETL tools don’t work well with modern data
• Changing schemas, complex or semi-structured data, …
• Hand-coded scripts are a common substitute
• Data freshness
• How often do you replicate/synchronize?
• Data resolution
• Can’t store all the raw data in the RDBMS (due to scalability and/or cost)
• Need to sample, aggregate or time-constrain the data

Three Options: Monolithic Tools
• Single piece of software on top of

Big Data
• Performs both data
visualization (BI) and execution
• Utilize sampling or manual pre-
aggregation to reduce the data
volume that the user is
interacting with
• Arcadia • Indexima • Platfora

• Atscale • Jethro • Tamr
• Datameer • Looker • ZoomData
Copyright © William El Kaim 2016 Source: Modified from Platfora 87
Three Options: Monolithic Tools
• Pros
• Only one tool to learn and operate
• Easier than building and maintain ETL-to-RDBMS pipeline
• Integrated data preparation in some solutions
• Cons
• Can’t analyze the raw data
• Rely on aggregation or sampling before primary analysis
• Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …)
• Can’t run arbitrary SQL queries

Three Options: SQL-on-Hadoop
• The combination of a familiar interface (SQL) along with a modern computing
architecture (Hadoop) enables people to manipulate and query data in new
and powerful ways.
• There’s no shortage of SQL on Hadoop offerings, and each Hadoop
distributor seems to have its preferred flavor.
• Not all SQL-on-Hadoop tools are equal, so picking the right tool is a challenge.
Copyright © William El Kaim 2016 Source: Datanami & Cloudera & Dremio 89
Three Options: SQL-on-Hadoop
• SQL on Hadoop tools could be categorized as
• Interactive or Native SQL
• Batch & Data-Science SQL
• OLAP Cubes (In-memory) on Hadoop

SQL-on-Hadoop: Native SQL
• When to use it?
• Excel at executing ad-hoc SQL queries and performing self-service data exploration
often used directly by data analysts or at executing the machine-generated SQL code
from BI tools like Qlik and Tableau.
• Latency is usually measured in seconds to minutes.
• One of the key differentiator among the interactive SQL-on-Hadoop tools is how they
were built.
• Some of the tools, such as Impala and Drill, were developed from the beginning to run on
Hadoop clusters, while others are essentially ports of existing SQL engines that previously ran
on vendors’ massively parallel processing (MPP) databases
Copyright © William El Kaim 2016 Source: Datanami 91

SQL-on-Hadoop: Native SQL
• Pros • Interactive
• Highest performance for Big Data • In 2012, Cloudera rolled out the first
workloads release of Apache Impala
• Connect to Hadoop and also NoSQL • MapR has been pushing the schema-
systems less bounds of SQL querying with
Apache Drill, which is based
• Make Hadoop “look like a database” on Google‘s Dremel.
• Cons • Presto (created by Facebook, now
• Queries may still be too slow for backed by Teradata)
interactive analysis on many TB/PB • VectorH (backed by Actian)
• Can’t defeat physics • Apache Hawq (backed by Pivotal)
• Apache Phoenix.
• BigSQL (backed by IBM)
• Big Data SQL (backed by Oracle)
• Vertica SQL on Hadoop (backed
by Hewlett-Packard).
Copyright © William El Kaim 2016 Source: Datanami & Dremio 92

SQL-on-Hadoop: Batch & Data Science SQL
• When to use it?
• Most often used for running big and complex jobs, including ETL and production data
“pipelines,” against massive data sets.
• Apache Hive is the best example of this tool category. The software essentially recreates a
relational-style database atop HDFS, and then uses MapReduce (or more recently, Apache
Tez) as an intermediate processing layer.
• Tools
• Apache Hive, Apache Tez, Apache Spark SQL
• Pros
• Potentially simpler deployment (no daemons)
• New YARN job (MapReduce/Spark) for each query
• Check-pointing support enables very long-running queries
• Days to weeks (ETL work)
• Works well in tandem with machine learning (Spark)
• Cons
• Latency prohibitive for for interactive analytics
• Tableau, Qlik Sense, …
• Slower than native SQL engines
Copyright © William El Kaim 2016 Source: Datanami & Dremio 93

SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop
• When to use it?
• Data scientists doing self-service data exploration needing performance (in milliseconds to
seconds).
• Apache Spark SQL pretty much owns this category, although Apache Flink could provide Spark SQL with
competition in this category.
• Often require an In-memory computing architecture,
• Tools
• Apache Kylin, Apache Lens, AtScale, Druid, Kyvos Insights
• In-memory: Spark SQL, Apache Flink, Kognitio On Hadoop
• Other Options To Investigate:
• SnappyData (Strong SQL, In-Memory Speed, and GemfireXD history)
• Apache HAWQ (Strong SQL support and Greenplum history)
• Splice Machine (Now Open Source)
• Hive LLAP is moving into OLAP, SQL 2011 support is growing and so is performance.
• Apache Phoenix may be able to do basic OLAP with some help from Saiku OLAP BI Tool.
• Most tools use Apache Calcite

• Pros
• Fast queries on pre-aggregated data
• Can use SQL and MDX tools
• Cons
• Explicit cube definition/modeling phase
• Not “self-service”
• Frequent updates required due to dependency on business logic
• Aggregation create and maintenance can be long (and large)
• User connects to and interacts with the cube
• Can’t interact with the raw data

• Apache Kylin lets you query massive data set at sub-second latency in 3 steps.
1. Identify a Star Schema data on Hadoop.
2. Build Cube on Hadoop.
3. Query data with ANSI-SQL and get results via ODBC, JDBC or RESTful API.
Copyright © William El Kaim 2016 Source: Apache Kylin 96

Copyright © William El Kaim 2016 http://kognitio.com/on-hadoop/ 97

SQL-on-Hadoop: Synthesis
• Pros
• Continue using your favorite BI tools and SQL-based clients
• Tableau, Qlik, Power BI, Excel, R, SAS, …
• Technical analysts can write custom SQL queries
• Cons
• Another layer in your data stack
• May need to pre-aggregate the data depending on your scale
• Need a separate data preparation tool (or custom scripts)

SQL-on-Hadoop: Synthesis

SQL-on-Hadoop: Encoding Formats
• The different encoding standards result in different block sizes, and that can
impact performance.
• ORC files compress smaller than Parquet files which can be a decisive choice factor.
• Impala, for example, accesses HDFS data that’s encoded in the Parquet format, while
Hive and others support optimized row column (ORC) files, sequence files, or plain text.
• Semi-structured data format like JSON is gaining traction
• Previously Hadoop users were using MapReduce to pound unstructured data into a
more structured or relational format.
• Drill opened up SQL-based access directly to semi-structured data, such as JSON,
which is a common format found on NoSQL and SQL databases. Cloudera also recently
added support for JSON in Impala.

SQL-on-Hadoop: Decision Tree

Example: AtScale
Copyright © William El Kaim 2016 Source: AtScale 102

Plan
• Why Now?
• What is Hadoop?
What is Big Data Analytics?

Data Science
• Data Science is an interdisciplinary field about processes and systems to
extract knowledge or insights from data in various forms

Data Science Maturity Model
Copyright © William El Kaim 2016 Source: Domino 106

Data Science Platform: Domino
Copyright © William El Kaim 2016 https://www.dominodatalab.com/ 107

Predictive Analytics
Source: Forrester
Source: wikipedia 108

Copyright © William El Kaim 2016
Finding the Best Approach
Copyright © William El Kaim 2016 Source: kdnuggets 109

What is Machine Learning?
• Machine learning is a scientific discipline that explores the construction and
study of algorithms that can learn from data.
• Such algorithms operate by building a model from example inputs and using that to
make predictions or decisions, rather than following strictly static program instructions.
• Machine learning is ideal for exploiting the opportunities hidden in big data.
Source: Rubén Casado Tejedor

Machine Learning Use Cases
• Programming computers to perform an action using example data or past
experience
• learn from and make predictions on data
• It is used when:
• Human expertise does not exist (e.g. Navigating on Mars)
• Humans are unable to explain their expertise (e.g. Speech recognition)
• Solution changes in time (e.g. Routing on a computer network)
• Solution needs to be adapted to particular cases (e.g. User biometrics)

Machine Learning Terminology
• Observations • Labels
• Items or entities used for learning or – Values / categories assigned to
evaluation (e.g., emails) observations (e.g., spam, not-spam)
• Features • Training and Test Data
• Attributes (typically numeric) used to – Observations used to train and
represent an observation (e.g. length, evaluate a learning algorithm (e.g., a
date, presence of keywords) set of emails along with their labels)
– Training data is given to the algorithm
for training while Test data is withheld
at train time

Machine Learning Types
• Supervised Learning: Learning from labelled observations
• Data is tagged
• Tagging may require the help of an expert to prepare the training set.
• Expertise is needed before machine learning.
• The challenge is about the generalization of the model
• Algorithms: Classification - Regression / Prediction - Recommendation
• Unsupervised Learning: Learning algorithm must find latent structure from
features alone.
• Output values are not known (aka. the tags, and their nature)
• Some of the attributes might be not homogeneous amongst all the samples
• The expertise is needed after machine learning, to interpret the results, and name the
discovered categories
• The challenge is about understanding the output classification
• Algorithms: generally group inputs by similarities (creating clusters)
• Clustering - Dimensionality Reduction - Anomaly detection

Machine Learning Types
The two phases of machine learning:

• TRAIN a model
• PREDICT with a model
Source: Louis Dorard

Machine Learning Example: Scikit-learn Algorithms Cheat-seet
Copyright © William El Kaim 2016 Source: scikit-learn 115

Machine Learning Example: Wendelin
Copyright © William El Kaim 2016 http://www.wendelin.io/ 116

Machine Learning Example: Microsoft Azure ML
Cloud
Source: Microsoft
Big Data: Azure Machine Learning

Approaches to Big Data Analytics: Sampling
• When to Use it? • Key Process
• Only data exploration / data understanding • Data Movement: Pulls sample data
• Early prototyping on prepared and clean data from HDFS/Hive/Impala
• Machine Learning modeling with very few and basic • Data Processing: In the analytics tool
patterns (e.g. only a handful of columns and binary
prediction target)
• When NOT to use it?
• Large number of columns in the data
• Need to blend large data sets (e.g. large-scale joins)
• Complex Machine Learning models
• Looking for rare events
• Pros
• Simple and easy to start with
• Usually works well for data exploration and early prototyping
• Some ML models would not benefit from more data anyway
• Cons
• Many ML models would benefit from more data
• Cannot be used when large scale data preparation is
needed
• Hadoop is used as a data repository only
Copyright © William El Kaim 2016 Source: RapidMiner 119
Approaches to Big Data Analytics: Grid Computing
• Task can be performed on smaller, independent data • Data Movement: Only results are moved,
subsets data remains in Hadoop
• Compute-intensive data processing • Data Processing: Custom single-node
• When NOT to use it? application running on multiple Hadoop
nodes
• Data-intensive data processing
• Complex Machine Learning models
• Lots of interdependencies between data subsets
• Pros
• Hadoop is used for parallel processing in addition to
using as a data source
• Cons
• Only works if data subsets can be processed
independently
• Only as good as the single-node engine, no benefit from
fast-evolving Hadoop innovations

Approaches to Big Data Analytics: Native Distributed Algorithms
• Complex Machine Learning models needed • Data Movement: Only results are
• Lots of interdependencies inside the data (e.g. graph moved, data remains in Hadoop
analytics) • Data Processing: Executed by native
• Need to blend and cleanse large data sets (e.g. large-scale Hadoop tools: Hive, Spark, H2O, Pig,
joins) MapReduce, etc.
• When NOT to use it?
• Data is not that large
• Sample would reveal all interesting patterns
• Pros
• Holistic view of all data and patterns
• Highly scalable distributed processing optimized
• for Hadoop
• Cons
• Limited set of algorithms available, very hard to develop new
algorithms

Approaches to Big Data Analytics: RapidMiner Example

Plan
• Why Now?
• What is Hadoop?
Storage vs. Processing
R, Python,… SciKit R, Python,… Scalding
Hive Pig Cascading Streaming Streaming Cascading Mahout
Map
HBase Impala Hawq Map Reduce /
Spark Reduce / Stinger Giraph Hama Spark
Tez
Tez
Machine
OLAP OLTP
Learning
Processing
Hadoop Distributed
Storage
Distributed FS Local FS NoSQL datastores
Openstack
GlusterFS HDFS S3 MapR Isilon Cassandra DynamoDB Ceph Ring Swift

Big Data Technologies

Plan
• Why Now?
• What is Hadoop?
• Ingestion Technologies
Understanding Streaming Semantics

Ingestion Technologies
Apache Flume
• Apache Flume is a distributed and reliable service for efficiently collecting,
aggregating, and moving large amounts of streaming data into HDFS
(especially “logs”).
• Data is pushed to the the destination (Push Mode).
• Flume does not replicate events - in case of flume-agent failure, you will lose events in
the channel

Apache Kafka
• Apache Kafka is a a fast, scalable, durable, and fault-tolerant publish-
subscribe messaging system, developed by LinkedIn, that persists
messages to disk (Pull Mode)
• Designed for high Throughput, Kafka is often used in place of traditional
message brokers like JMS and AMQP because of its higher throughput,
reliability, and replication.
• Use Topics which many listeners can subscribe to, and thus processing of messages
can happen in parallel on various channels
• High availability of events (recoverable in case of failures)

Apache Storm
• Apache Storm, developed by BackType (bought wy Twitter) is a reliable real-
time system for processing streaming data in real time (and generating new
streams).
• Designed to support wiring “spouts” (think input streams) and “bolts”
(processing and output modules) together as a directed acyclic graph
(DAG) called a topology.
• One strength is the catalogue of available spouts specialized for receiving data from all types
of sources.
• Storm topologies run on clusters and the Storm scheduler distributes work to
nodes around the cluster, based on the topology configuration.

Storm: Example
• Twitter streams, counting words, and storing them in NoSQL database
Copyright © William El Kaim 2016 Source: Trivadis 131

Storm: Example

Twitter Heron
• Twitter dropped Apache Storm in production in 2015 and replaced it with a
homegrown data processing system, named Heron.
• Apache Storm was the original solution to Twitter's problems.
• Storm it was reputedly hard to work with and hard to get good results from, and despite
a recent 1.0 renovation, it's been challenged by other projects, including Apache Spark
and its own revised streaming framework.
• Heron was built from scratch with a container- and cluster-based design,
outlined in a research paper.
• The user creates Heron jobs, or "topologies," and submits them to a scheduling system,
which launches the topology in a series of containers.
• The scheduler can be any of a number of popular schedulers, like Apache Mesos or
Apache Aurora. Storm, by contrast, has to be manually provisioned on clusters to add
scale.
• In May 2016 Twitter released Heron under an open source license
Copyright © William El Kaim 2016 Source: Infoworld 133

Twitter Heron
• Heron is backward-compatible with Storm's API.
• Storm spouts and bolts could then be reused in Heron
• Gives existing Storm users some incentive to check out Heron.
• Heron
• Code is to be written in Java (or Scala
• The web-based UI components are written in Python
• The critical parts of the framework, the code that manages the topologies and network
communications is written in C++.
• Twitter claims it's been able to gain anywhere from two to five times an
improvement in "efficiency" (basically, lower opex and capex) with Heron.
Copyright © William El Kaim 2016 Source: Infoworld 134

Apache Spark
• Spark supports real-time distributed computation and stream-oriented
processing, but it's more of a general-purpose distributed computing
platform.
• In-memory data storage for very fast iterative processing
• Replacement for the MapReduce functions of Hadoop, running on top of an existing
Hadoop cluster, relying on YARN for resource scheduling.
• Spark can layer on top of Mesos for scheduling or run as a stand-alone cluster using its
built-in scheduler.
• Spark shines is in its support for multiple processing paradigms and the supporting
libraries

Apache Spark
Source: Databricks
Source: Ippon

Apache Spark
• Spark Core
• General execution engine for the Spark platform
• In-memory computing capabilities deliver speed
• General execution model supports wide variety of use cases
• Spark Streaming
• Run a streaming computation as a series of very small, deterministic batch jobs
• Batch size as low as ½ sec, latency of about 1 sec
• Exactly-once semantics
• Potential for combining batch and streaming processing in same system

Apache Spark
• At the core of Apache Spark is the notion of data abstraction as distributed
collection of objects called Resilient Distributed Dataset (RDD)
• RDD Allows you to write programs that transform these distributed datasets.
• RDDs are Immutable, recomputable, and fault tolerant distributed collection of objects
(partitions) spread across a cluster of machines
• data can be stored in memory or in disk (local).
• RDD enables parallel processing on data sets
• Data is partitioned across machines in a cluster that can be operated in parallel with a low-
level API that offers transformations and actions.
• RDDs are fault tolerant as they track data lineage information to rebuild lost data
automatically on failure.
• Contains transformation history (“lineage”) for whole data set
• Operations
• Stateless Transformations (map, filter, groupBy)
• Actions (count, collect, save)

Apache Spark
• DataFrame is an immutable distributed collection of data (like RDD)
• Unlike an RDD, data is organized into named columns, like a table in a relational
database.
• Designed to make large data sets processing even easier, DataFrame allows developers
to impose a structure onto a distributed collection of data, allowing higher-level
abstraction.
• It provides a domain specific language API to manipulate your distributed data
• makes Spark accessible to a wider audience, beyond specialized data engineers.
• Datasets
• Introduced in Spark 1.6, the goal of Spark Datasets is to provide an API that allows
users to easily express transformations on domain objects, while also providing the
performance and benefits of the robust Spark SQL execution engine.
• In Spark 2.0, the DataFrame APIs will merge with Datasets APIs, unifying
data processing capabilities across all libraries.

Apache Spark Machine Learning
• MLlib is Apache Spark general machine learning library
• Allows data scientists to focus on their data problems and models instead of solving the
complexities surrounding distributed data (such as infrastructure, configurations, etc.).
• The data engineers can focus on distributed systems engineering using Spark’s easy-to-
use APIs, while the data scientists can leverage the scale and speed of Spark core.
• ML Pipelines
• Running machine learning algorithms involves executing a sequence of tasks including
pre-processing, feature extraction, model fitting, and validation stages.
• High-Level API for MLlib that lives under the “spark.ml” package.
• A pipeline consists of a sequence of stages. There are two basic types of pipeline
stages: Transformer and Estimator.
• A Transformer takes a dataset as input and produces an augmented dataset as output.
• An Estimator must be first fit on the input dataset to produce a model, which is a Transformer
that transforms the input dataset.

Google Cloud Dataflow
• Fully-managed cloud service and
programming model for batch and
streaming big data processing.
• Used for developing and executing a
wide range of data processing patterns
including ETL, batch computation, and
continuous computation.
• Cloud Dataflow “frees” from operational
tasks like resource management and
performance optimization.
• The open source Java-based Cloud
Dataflow SDK enables developers to
implement custom extensions and to
extend Dataflow to alternate service
environments
Copyright © William El Kaim 2016 Source: Google 141

Google Dataflow vs. Spark
• Dataflow is clearly faster than Spark.
• But Spark has an ace up its sleeve in the form of REPL, or its “read evaluate
print loop” functionality, which enables users to iterate on their problems
quickly and easily.
• “If you have a bunch of data scientists and you’re trying to figure out what they want to
do, and they need to play around a lot, then Spark may be a better solution for those
sorts of cases,” Oliver says.
• While Spark maintains an edge among data scientists looking to iterate
quickly, Google Cloud Dataflow seems to hold the advantage in the
operations department, thanks to all the work that Google has done over the
years to optimize queries at scale.
• “Google Cloud Dataflow has some key advantages, in particular if you have a well
thought out process that you’re trying to implement, and you’re trying to do it cost
effectively…then Google Cloud Dataflow is an excellent option for doing it at scale and
at a lower cost,” Oliver says.

Apache Beam
• Apache Beam is an open source, unified programming model used to create
a data processing pipeline.
• Start by building a program that defines the pipeline using one of the open source Beam
SDKs.
• The pipeline is then executed by one of Beam’s supported distributed processing
back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.
• Beam is particularly useful for Embarrassingly Parallel data processing
tasks, in which the problem can be decomposed into many smaller bundles
of data that can be processed independently and in parallel.
• Beam cans also be used for Extract, Transform, and Load (ETL) tasks and
pure data integration.

Concord

http://concord.io/ 144
Apache Flink
• Flink’s core is a streaming dataflow engine that provides data distribution,
communication, and fault tolerance for distributed computations over data
streams.
• Flink includes several APIs for creating applications that use the Flink
engine:
• DataStream API for unbounded streams embedded in Java and Scala, and
• DataSet API for static data embedded in Java, Scala, and Python,
• Table API with a SQL-like expression language embedded in Java and Scala.
• Flink also bundles libraries for domain-specific use cases:
• CEP, a complex event processing library,
• Machine Learning library, and
• Gelly, a graph processing API and library.

Apache Flink
Source: Ippon
Source: Apache
Apache Flink Commercial Support
Copyright © William El Kaim 2016 Data Artisans 147

Spark vs. Flink
• Flink is:
• optimized for cyclic or iterative processes by
using iterative transformations on collections.
• This is achieved by an optimization of join
algorithms, operator chaining and reusing of
partitioning and sorting.
• However, Flink is also a strong tool for batch
processing.
• Spark is:
• based on resilient distributed datasets (RDDs).
• This (mostly) in-memory data structure gives the
power to sparks functional programming
paradigm. It is capable of big batch calculations
by pinning memory.
Source: Quora
Source: Zalando 148

Apache APEX
• Apache Apex is a YARN-native integrated platform that unifies
stream and batch processing.
• It processes big data in-motion in a way that is highly scalable, highly
performant, fault tolerant, statefull, secure, and distributed.
• Github
• Comparisons to others
• Spark and Storm are considered difficult to use. They’re built on batch
engines, rather than true streaming architecture, and don’t natively
support statefull computation,
• They can’t do low-latency processing that Apex and Flink can, and will
suffer a latency overhead for having to schedule batches repeatedly, no
matter how quickly that occurs.
• Use cases
• GE’s Predix IoT cloud platform uses Apex for industrial data and
analytics
• Capital One for real-time decisions and fraud detection.
Copyright © William El Kaim 2016 Source: ASF 149

• Apache Samza
• Samza is a distributed stream-processing framework that is based on Apache Kafka and
YARN.
• It provides a simple callback-based API that’s similar to MapReduce, and it includes
snapshot management and fault tolerance in a durable and scalable way.
• Amazon Kinesis
• Kinesis is Amazon’s service for real-time processing of streaming data on the cloud.
• Deeply integrated with other Amazon services via connectors, such as S3, Redshift, and
DynamoDB, for a complete Big Data architecture.

• NFS Gateway
• The NFS Gateway supports NFSv3 and allows HDFS to be
mounted as part of the client’s local file system.
• Apache Sqoop
• Tool designed for efficiently transferring bulk data between Hadoop
and structured data stores (and vice-versa).
• Import data from external structured data stores into a HDFS
• Extract data from Hadoop and export it to external structured data
stores like relational database and enterprise data warehouses.

Sqoop Example

Source: Rubén Casado Tejedor 152
Plan
• Why Now?
• What is Hadoop?
• Storage Technologies
Storage Technology
Rise of Immutable Datastore
• In a relational database, files are mutable,

which means a given cell can be
overwritten when there are changes to the
data relevant to that cell.
• New architectures offer accumulate-
only file system that overwrites nothing.
Each file is immutable, and any changes
are recorded as separate timestamped
files.
• The method lends itself not only to faster
and more capable stream processing, but
also to various kinds of historical time-
series analysis.

Storage Technology
Why is immutability in big data stores significant?
• Fewer dependencies & Higher-volume data handling and improved site-response
capabilities
• Immutable files reduce dependencies or resource contention, which means one part of the system doesn’t
need to wait for another to do its thing. That’s a big deal for large, distributed systems that need to scale
and evolve quickly.
• More flexible reads and faster writes
• writing data without structuring it beforehand means that you can have both fast reads and
writes, as well as more flexibility in how you view the data.
• Compatibility with Hadoop & log-based messaging protocols
• A popular method of distributed storage for less-structured data.
• Ex: Apache Samza and Apache Kafka are symbiotic and compatible with the Hadoop Distributed File
System (HDFS),.
• Suitability for auditability and forensics
• Log-centric databases and the transactional logs of many traditional databases share a
common design approach that stresses consistency and durability (the C and D in ACID).
• But only the fully immutable shared log systems preserve the history that is most helpful for
audit trails and forensics.

Storage Technologies: Databases
Evolutions

Storage Technologies: Cost & Speed

Storage Technologies
• HDFS: Distributed FileSystem for Hadoop
• A Java-based filesystem that provides scalable and reliable data storage. Designed to
span large clusters of commodity servers.
• Master-Slaves Architecture (NameNode – DataNodes)
• NameNode: Manage the directory tree and regulates access to files by clients
• DataNodes: Store the data. Files are split into blocks of the same size and these blocks are
stored and replicated in a set of DataNodes
• Apache Hive
• An open-source data warehouse system for querying and analyzing large datasets
stored in Hadoop files.
• Abstraction layer on top of MapReduce
• SQL-like language called HiveQL.
• Metastore: Central repository of Hive metadata.

• Apache KUDU
• Kudu is an innovative new storage engine that is designed from the ground up to
overcome the limitations of various storage systems available today in the Hadoop
ecosystem.
• For the very first time, Kudu enables the use of the same storage engine for large scale
batch jobs and complex data processing jobs that require fast random access and
updates.
• As a result, applications that require both batch as well as real-time data processing
capabilities can use Kudu for both types of workloads.
• With Kudu’s ability to handle atomic updates, you no longer need to worry about
boundary conditions relating to late-arriving or out-of-sequence data.
• In fact, data with inconsistencies can be fixed in place in almost real time, without wasting time
deleting or refreshing large datasets.
• Having one system of record that is capable of handling fast data for both analytics and real-
time workloads greatly simplifies application design and implementation.

• HBase
• An open source, non-relational, distributed column-oriented database written in Java.
• Modeled after Google’s BigTable and developed as part of Apache Hadoop project, it runs on
top of HDFS.
• Random, real time read/write access to the data.
• Very light «schema», Rows are stored in sorted order.
• MapR DB
• An enterprise-grade, high performance, in-Hadoop No-SQL database management
system, MapR is used to add real-time operational analytics capabilities to Hadoop.
• Pivotal HDB
• Hadoop Native SQL Database powered by Apache HAWQ

• Apache Impala
• Open source MPP analytic database built to work with data stored on open, shared data
platforms like Apache Hadoop’s HDFS filesystem, Apache Kudu’s columnar storage, and
object stores like S3.
• By being able to query data from multiple sources stored in different, open formats like
Apache Parquet, Apache Avro, and text, Impala decouples data and compute and lets
users query data without having to move/load data specifically into Impala clusters.
• In the cloud, this capability is especially useful as you can create transient clusters with Impala
to run your reports/analytics and shut down the cluster when you are done or elastically scale
compute power to support peak demands, letting you save on cluster-hosting costs.
• Impala is designed to run efficiently on large datasets, and scales to hundreds of nodes
and hundreds of users.
• You can learn more about the unique use cases Impala on S3 delivers in this blog post.

• MemSQL
• MemSQL unveiled its “Spark Streamliner” initiative, in which it incorporated Apache
Spark Streaming as a middleware component to buffer the parallel flow of data coming
from Kafka before it’s loaded into MemSQL’s consistent storage.
• This enabled customers like Pintrest to eliminate batch processing and move to continuous
processing of data.
• The exactly-once semantics is available through the “Create Pipeline” command in
MemSQL version 5.5.
• The command will automatically extract data from the Kafka source, perform some type of
transformation, and then load it into the MemSQL database’s lead nodes (as opposed to
loading them in MemSQL’s aggregator nodes first, as it did with Streamliner).
• The database can work on multiple, simultaneous streams, and while adhering to exactly-once
semantics

Storage Technology
Technology Landscape

Storage Technology: Encoding Format
• A huge bottleneck for HDFS-enabled applications like MapReduce and
Spark is the time it takes to find relevant data in a particular location and the
time it takes to write the data back to another location.
• These issues are exacerbated with the difficulties managing large datasets, such as
evolving schemas, or storage constraints.
• Choosing an appropriate file format can have some significant benefits:
• Faster read times
• Faster write times
• Splittable files (so you don’t need to read the whole file, just a part of it)
• Schema evolution support (allowing you to change the fields in a dataset)
• Advanced compression support (compress the files with a compression codec without
sacrificing these features)
Copyright © William El Kaim 2016 Source: Matthew Rathbone 164

• The format of the files you can store on HDFS, like any filesystem, is entirely
up to you.
• However unlike a regular file system, HDFS is best used in conjunction with a data
processing toolchain like MapReduce or Spark.
• These processing systems typically (although not always) operate on some form of
textual data like webpage content, server logs, or location data.
• If you’re just getting started with Hadoop, HDFS, Hive and wondering what
file format you should be using to begin with, then use tab delimited files
for your prototyping (and first production jobs).
• They’re easy to debug (because you can read them), they are the default format of Hive,
and they’re easy to create and reason about.
• Once you have a production MapReduce or Spark job regularly generating data come
back and pick something better

Encoding Format: Text Files (E.G. CSV,
TSV)
• Data is laid out in lines, with each line being a record. Lines are terminated
by a newline character \n in the typical Unix fashion.
• Text-files are inherently splittable (just split on \n characters!), but if you want
to compress them you’ll have to use a file-level compression codec that
support splitting, such as BZIP2
• Because these files are just text files you can encode anything you like in a
line of the file.
• One common example is to make each line a JSON document to add some structure.
While this can waste space with needless column headers, it is a simple way to start
using structured data in HDFS.

Encoding Format: Sequence Files
• Sequence files were originally designed for MapReduce.
• They encode a key and a value for each record and nothing more.
• Records are stored in a binary format that is smaller than a text-based format would be.
• Like text files, the format does not encode the structure of the keys and values, so if you make
schema migrations they must be additive.
• Sequence files by default use Hadoop’s Writable interface in order to figure out
how to serialize and de-serialize classes to the file.
• Typically if you need to store complex data in a sequence file you do so in the value part while
encoding the id in the key. The problem with this is that if you add or change fields in your
Writable class it will not be backwards compatible with the data stored in the sequence file.
• One benefit of sequence files is that they support block-level compression, so
you can compress the contents of the file while also maintaining the ability to split
the file into segments for multiple map tasks.
• Sequence files are well supported across Hadoop and many other HDFS
enabled projects, and I think represent the easiest next step away from text files.
• More : Apache Hadoop SequenceFile wiki
Encoding Format: AVRO
• Avro is not really a file format, it’s a file format plus a serialization and deserialization
framework.
• Encodes the schema of its contents directly in the file which allows to store complex objects natively.
• Avro provides:
• Rich data structures.
• A compact, fast, binary data format.
• A container file, to store persistent data.
• Remote procedure call (RPC).
• Simple integration with dynamic languages. Code generation is not required to read or write data files
nor to use or implement RPC protocols. Code generation as an optional optimization, only worth
implementing for statically typed languages.
• Avro defines file data schemas in JSON (for interoperability), allows for schema evolutions
(remove a column, add a column), and multiple serialization/deserialization use cases.
• Avro supports block-level compression.
• For most Hadoop-based use cases Avro is a really good choice.
• More: Apache Avro web site

Encoding Format: Columnar File Formats
• The latest evolution concerning file formats for Hadoop is columnar file storage.
• Basically this means that instead of just storing rows of data adjacent to one another you
also store column values adjacent to each other.
• So datasets are partitioned both horizontally and vertically. This is particularly useful if your data
processing framework just needs access to a subset of data that is stored on disk as it can access all
values of a single column very quickly without reading whole records.
• One huge benefit of columnar oriented file formats is that data in the same column tends to
be compressed together which can yield some massive storage optimizations (as data in
the same column tends to be similar).
• If you’re chopping and cutting up datasets regularly then these formats can be very
beneficial to the speed of your application
• if you have an application that usually needs entire rows of data then the columnar formats may
actually be a detriment to performance due to the increased network activity required.
• Overall these formats can drastically optimize workloads, especially for Hive and Spark
which tend to just read segments of records rather than the whole thing (which is more
common in MapReduce).
• Two file formats:
• Apache Parquet seems to have the most community support.
• RCFile like Apache Orc


Two Ways To Compress Data In Hadoop
• File-Level Compression
• compress entire files regardless of the file format, the same way you would compress a
file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if indexed).
• Block-Level Compression
• Is internal to the file format, so individual blocks of data within the file are compressed.
• This means that the file remains splittable even if you use a non-splittable compression
codec
• Snappy is a great balance of speed and compression ratio.

Plan
• Why Now?
• What is Hadoop?
• Hadoop Processing Paradigms & Technologies
Hadoop Processing Paradigms
Batch processing
• Large amount of statics data
• Generally incurs a high-latency / Volume
Real-time processing
• Compute streaming data
• Low latency
• Velocity
Hybrid computation
• Lambda Architecture
• Volume + Velocity
Copyright © William El Kaim 2016 Source: Rubén Casado & Cloudera 173
Hadoop Processing Paradigms & Time

Hadoop Batch processing
• Scalable
• Large amount of static data
• Distributed
Volume
• Parallel
• Fault tolerant
• High latency
Copyright © William El Kaim 2016 Source: Rubén Casado 175

Hadoop – Batch Processing - Map Reduce
• MapReduce was designed by
Google as a programming model for
processing large data sets with a
parallel, distributed algorithm on a
cluster.
• Key Terminology
• Job: A “full program” - an execution of a
Mapper and Reducer across a data set
• Task: An execution of a Mapper or a
Reducer on a slice of data – a.k.a. Task-
In-Progress (TIP)
• Task Attempt: A particular instance of an
attempt to execute a task on a machine

Copyright © William El Kaim 2016 Source: Hadooper 177
• Processing can occur on data stored either in a filesystem (unstructured) or
in a database (structured).
• MapReduce can take advantage of the locality of data, processing it near the
place it is stored in order to reduce the distance over which it must be
transmitted.
• "Map" step
• Each worker node applies the "map()" function to the local data, and writes the output to a
temporary storage.
• A master node ensures that only one copy of redundant input data is processed.
• "Shuffle" step
• Worker nodes redistribute data based on the output keys (produced by the "map()" function),
such that all data belonging to one key is located on the same worker node.
• "Reduce" step
• Worker nodes now process each group of output data, per key, in parallel.


Batch Processing Technologies

Batch Processing Architecture Example
Copyright © William El Kaim 2016 Source: Helena Edelson 181

Real-time Processing
• Low latency
• Continuous
unbounded
Velocity streams of data
• Distributed
• Parallel
• Fault-tolerant

Real-time Processing Technologies

Real-time (Stream) Processing
• Computational model and Infrastructure for continuous data processing, with
the ability to produce low-latency results
• Data collected continuously is naturally processed continuously (Event Processing or
Complex Event Processing -CEP)
• Stream processing and real-time analytics are increasingly becoming where
the action is in the big data space.
• As real-time streaming architectures like Kafka continue to gain steam, companies that
are building next-generation applications upon them will debate the merits of the unified
and the federated approaches


Real-time (Stream) Processing Arch.
Pattern
Copyright © William El Kaim 2016 Source: Cloudera 186

• (Event-) Stream Processing
• A one-at-a-time processing model
• A datum is processed as it arrives
• Sub-second latency
• Difficult to process state data efficiently
• Micro-Batching
• A special case of batch processing with very small batch sizes (tiny)
• A nice mix between batching and streaming
• At cost of latency
• Gives statefull computation, making windowing an easy task

Hybrid Computation Model
• Low latency
• Massive data + Streaming data
• Scalable
• Combine batch and real-time results
Volume Velocity

Hybrid Computation: Lambda Architecture
• Data-processing architecture designed to handle massive quantities of data
by taking advantage of both batch- and stream-processing methods.
• A system consisting of three layers: batch processing, speed (or real-time)
processing, and a serving layer for responding to queries.
• This approach to architecture attempts to balance latency, throughput, and fault-
tolerance by using batch processing to provide comprehensive and accurate views of
batch data, while simultaneously using real-time stream processing to provide views of
online data.
• The two view outputs may be joined before presentation.
• Lambda Architecture case stories via lambda-architecture.net
Copyright © William El Kaim 2016 Source: Kreps 189

Hybrid Computation: Lambda Architecture
• Batch layer
• Receives arriving data, combines it with historical data and recomputes results by
iterating over the entire combined data set.
• The batch layer has two major tasks:
• managing historical data; and recomputing results such as machine learning models.
• Operates on the full data and thus allows the system to produce the most accurate
results. However, the results come at the cost of high latency due to high computation
time.
• The speed layer
• Is used in order to provide results in a low-latency, near real-time fashion.
• Receives the arriving data and performs incremental updates to the batch layer results.
• Thanks to the incremental algorithms implemented at the speed layer, computation cost is
significantly reduced.
• The serving layer enables various queries of the results sent from the batch
and speed layers.

Hybrid computation: Lambda Architecture
Copyright © William El Kaim 2016 Source: Mapr 191

Hybrid computation: Lambda Architecture
DataTorrent
• DataTorrent RTS Core
• Open source enterprise-grade unified
stream and batch platform
• High performing, fault tolerant, scalable,
Hadoop-native in-memory platform
• Supports Kafka, HDFS, AWS S3n, NFS,
(s)FTP, JMS
• dtManage - DataTorrent Management
Console
• Hadoop-integrated application that
provides an intuitive graphical
management interface for Devops teams
• manage, monitor, update and troubleshoot
the DataTorrent RTS system and
applications
Copyright © William El Kaim 2016 Source: DataTorrent 192
Ex: Novelti.io (ex. Lambdoop)
Batch
Hybrid
Real-Time
Copyright © William El Kaim 2016 Source: Novelti.io 193

Ex: Lambda Architecture
Copyright © William El Kaim 2016 Source: Datastax 194

Ex: Lambda Architecture Stacks
Copyright © William El Kaim 2016 Source: Helena Edelson 195

Different Streaming Architecture Vision
• Hadoop major distributors have different views on how streaming fits into
traditional Hadoop architectures.
• Hortonworks has taken a data plane approach (with HDP)
• that seeks to virtually connect multiple data repositories in a federated manner
• to unify the security and governance of data existing in different places (on- and off-
premise data lakes like HDP and streaming data platforms like HDF).
• Specifically, it’s building hooks between Apache Atlas (the data governance component)
and Apache Knox (the security tool) that give customers a single view of their data.
• MapR is going all-in on the converged approach that stressed the
importance of a single unified data repository.
• Cloudera, meanwhile, sits somewhere in the middle (although it’s probably
closer to MapR).

Ex: Lambda Architecture Cloudera Vision
• Kafka as the piece of a larger real-time or near real-time architecture
• Combination of Kafka and Spark Streaming for the so called speed layer.
• In conjunction with a batch layer, leading to the use of lambda architecture
• Because people want to operate with larger history of events
• Kudu project as the real optimized store for Lambda architectures because
• KUDU offers a happy medium between the scan performance of HDFS and the record-
level updating capability of Hbase.
• It enables real-time response to single events and can be the speed layer and batch
layer for a single store

Hybrid computation: Kappa Architecture
• Proposal from Jay Kreps (LinkedIn) in this article.
• Then talk “Turning the database inside out with Apache Samza” by Martin Kleppmann
• Main objective
• Avoid maintaining two separate code bases for the batch and speed layers (lambda).
• Key benefits
• Handle both real-time data processing and continuous data reprocessing using a single
stream processing engine.
• Data reprocessing is an important requirement for making visible the effects of code
changes on the results.

• Architecture is composed of only two layers:
• The stream processing layer runs the stream processing jobs.
• Normally, a single stream processing job is run to enable real-time data processing.
• Data reprocessing is only done when some code of the stream processing job needs to be
modified.
• This is achieved by running another modified stream processing job and replying all previous data.
• The serving layer is used to query the results (like the Lambda architecture).
Copyright © William El Kaim 2016 Source: O’Reilly 199

• Intrinsically, there are four main principles in the Kappa architecture:
• Everything is a stream: Batch operations become a subset of streaming operations.
Hence, everything can be treated as a stream.
• Immutable data sources: Raw data (data source) is persisted and views are derived, but
a state can always be recomputed as the initial record is never changed.
• Single analytics framework: Keep it short and simple (KISS) principle. A single analytics
engine is required. Code, maintenance, and upgrades are considerably reduced.
• Replay functionality: Computations and results can evolve by replaying the historical
data from a stream.
• Data pipeline must guarantee that events stay in order from generation to ingestion. This is
critical to guarantee consistency of results, as this guarantees deterministic computation
results. Running the same data twice through a computation must produce the same result
Copyright © William El Kaim 2016 Source: MapR 200

• Use Kafka or some other system that will let you retain the full log of the data
you want to be able to reprocess and that allows for multiple subscribers.
• For example, if you want to reprocess up to 30 days of data, set your retention in Kafka
to 30 days.
• When you want to do the reprocessing, start a second instance of your
stream processing job that starts processing from the beginning of the
retained data, but direct this output data to a new output table.
• When the second job has caught up, switch the application to read from the
new table.
• Stop the old version of the job, and delete the old output table.

Kappa Architecture Example

Hybrid computation: Lambda vs. Kappa
Lambda Used to value all data in a unique treatment chain
Kappa Used to provide the freshest data to customers
Source: Kreps

Hybrid computation: Lambda vs. Kappa
Lambda
Kappa
Source: Ericsson

Hadoop Processing Paradigms Evolutions

Processing Technologies
• MapReduce
• A programming model and an associated implementation for processing and generating
large data sets with a parallel, distributed algorithm on a cluster.
• Apache Hive
• Provides a mechanism to project structure onto large data sets and to query the data
using a SQL-like language called HiveQL.
• Apache Spark
• An open-source engine developed specifically for handling large-scale data processing
and analytics.
• Apache Storm
• A system for processing streaming data in real time that adds reliable real-time data
processing capabilities to Enterprise Hadoop.

• Apache Drill
• Called the Omni-SQL: Schema-free SQL Query Engine for Hadoop,
NoSQL and Cloud Storage
• An open-source software framework that supports data intensive
distributed applications for interactive analysis of large-scale datasets
• Apache Pig
• Platform for analyzing large data sets
• High-level procedural language for expressing data analysis programs.
• Pig Latin: Data flow programming language.
• Cacading
• Cascading is a data processing API and processing query planner used
for defining, sharing, and executing data-processing workflows
• Ease development of complex Hadoop MapReduce workflows
• In the same way as Pig Source: Dataiku

• Apache Drill is an engine that can connect to many different data sources,
and provide a SQL interface to them.
• standard data sources that you'd be able to query with SQL, like Oracle or MySQL
• can also work with flat files such as CSV or JSON
• as well as Avro and Parquet formats.
• It's capability to run SQL against files is a great feature.
• Example of how to use Drill here.
• Apache OMID
• Contributed to ASF by Yahoo
• Omid provides a high-performant ACID transactional framework with Snapshot Isolation
guarantees on top of HBase, being able to scale to thousands of clients triggering
transactions on application data.
• It’s one of the few open-source transactional frameworks that can scale beyond 100K
transactions per second on mid-range hardware while incurring minimal impact on the
latency accessing the datastore.

Stream Processing: HortonWorks
Dataflow
• Hortonworks DataFlow is an integrated platform to collect, conduct and
curate real-time data, moving it from any source to any destination.

Stream Processing: HortonWorks
Dataflow

Streaming PaaS: StreamTools
Copyright © William El Kaim 2016 http://blog.nytlabs.com/streamtools/ 211

Streaming PaaS: Striim
Copyright © William El Kaim 2016 http://www.striim.com/ 212

Streaming PaaS: StreamAnalytix
Copyright © William El Kaim 2016 http://streamanalytix.com/ 213

Streaming PaaS: InsightEdge
Copyright © William El Kaim 2016 http://insightedge.io/ 214

More Information
• The Hadoop Ecosystem Table
• Big Data Ingestion and Streaming Tools
• Apache Storm vs. Spark Streaming
• Data Science & Data Discovery Platforms Compared. Datameer and Dataiku
DSS go head to head
• Applying the Kappa architecture in the telco industry

Plan
• Why Now?
• What is Hadoop?
• Big Data Fabric
Big Data Fabric
Introduction
• Definition:
• Bringing together disparate big data sources automatically, intelligently, and securely,
and processing them in a big data platform technology, such as Hadoop and Apache
Spark, to deliver a unified, trusted, and comprehensive view of customer and business
data.
• Big data fabric focuses on automating the process of ingestion, curation, and
integrating big data sources to deliver intelligent insights that are critical for
businesses to succeed.
• The platform minimizes complexity by automating processes, generating big data
technology and platform code automatically, and integrating workflows to simplify the
deployment.
• Big data fabric is not just about Hadoop or Spark — it comprises several
components, all of which must work in tandem to deliver a flexible,
integrated, secure, and scalable platform.
• Big data fabric architecture has secore layers
Copyright © William El Kaim 2016 Source: Forrester 217

Big Data Fabric
Architecture
Source: Eckerson Group
Source: Forrester

Big Data Fabric
Six core Architecture Layers
• Data ingestion layer. • Data discovery layer.

• The data ingestion layer deals with getting the • This layer automates the discovery of new
big data sources connected, ingested, streamed, internal or external big data sources and
and moved into the data fabric. presents them as a new data asset for
• Big data can come from devices, sensors, logs, consumption by business users.
clickstreams, databases, applications, and • Dynamic discovery includes several components
various cloud sources, in the form of structured such as data modeling, data preparation,
or unstructured data. curation, and virtualization to deliver a flexible
• Processing and persistence layer. big data platform to support any use case.
• This layer uses Hadoop, Spark, and other • Data management and intelligence layer.
Hadoop ecosystem components such as Kafka, • This layer enables end-to-end data management
Flume, and Hive to process and persist big data capabilities that are essential to ensuring the
for use within the big data fabric framework. reliability, security, integration, and governance
• Orchestration layer. of data.
• The orchestration layer is a critical layer of the • Its components include data security,
big data fabric that transforms, integrates, and governance, metadata management, search,
cleans data to support various use cases in real data quality, and lineage.
time or near real time. • Data access layer.
• It can transform data inside Hadoop to enable • This layer includes caching and in-memory
integration, or it can match and clean data technologies, self-service capabilities and
dynamically. interactions, and fabric components that can be
embedded in analytical solutions, tools, and
dashboards.

Big Data Fabric Adoption Is In Its Infancy
• Most enterprises that have a big data fabric platform are building it
themselves by integrating various core open source technologies
• In addition, they are supporting the platform with commercial products for
data integration, security, governance, machine learning, SQL-on-Hadoop,
and data preparation technologies.
• However, organizations are realizing that creating a custom technology stack
to support a big data fabric implementation (and then customizing it to meet
business requirements) requires significant time and effort.
• Solutions are starting to emerge from vendors.

Make Big Data Fabric Part Of Your Big Data Strategy!
• Enterprise architects whose companies are pursuing a big data strategy can
benefit from a big data fabric implementation that automates, secures,
integrates, and curates big data sources intelligently.
• Your big data fabric strategy should:
• Integrate only a few big data sources at first.
• Start top-down rather than bottom-up, keeping the end in mind.
• Separate analytics from data management. Analytics tools should focus primarily on
data visualization and advanced statistical/data mining algorithms with limited
dependence on data management functions. Decoupling data management from data
analytics reduces the time and effort needed to deliver trusted analytics.
• Create a team of experts to ensure success.
• Use automation and machine learning to accelerate deployment..

Plan
• Why Now?
• What is Hadoop?
• Geo-Spatial-on-Hadoop
Geo-Spatial-on-Hadoop
• ESRI • Geomesa
• ESRI for Big Data • GeoMesa is an open-source,
distributed, spatio-temporal database
• Esri GIS tools for Hadoop: Toolkit built on Accumulo, HBase, Cassandra,
allowing developers to build analytical and Kafka.
tools leveraging both Hadoop and
Arcgis.
• SpatialHadoop
• Esri User Defined Functions built on top of
• open source MapReduce extension
the Esri Geometry API
designed specifically to handle huge
datasets of spatial data on Apache
• Pigeon: spatial extension to Pig that Hadoop.
allows it to process spatial data.
• SpatialHadoop is shipped with built-in
• Hive Spatial Query: adds spatial high level language, spatial
geometric user-defined data types, spatial indexes and
functions(UDFs) to Hive. efficient spatial operations.

Geo-Spatial-on-Hadoop
• GeoDataViz • Redis
• CartoDB • Open source (BSD licensed), in-
• Deep Insights technology is capable of memory data structure store, used as
handling and visualizing massive amounts database, cache and message broker. It
of contextual and time based location data. supports data structures such
• Spatialytics as strings, hashes, lists, sets,sorted
• Standard geoBI platform sets with range queries, bitmaps,
• mapD hyperloglogs and geospatial indexes with
• Leverage GPU and a dedicated NoSQL
radius queries.
database for better performance • Tutorial / Examples
• deck.gl (Uber) • How To Analyze Geolocation Data with
• WebGL-powered framework for visual Hive and Hadoop – Uber trips
exploratory data analysis of large datasets.
• Geo spatial data support for Hive using
• Data Converter Taxi data in NYC
• ESRI GeoJSon Utils • ESRI Wiki
• GDAL: Geospatial Data Abstraction
Library

Deck.GL
Copyright © William El Kaim 2016 http://uber.github.io/deck.gl/#/ 225

Plan
• Why Now?
• What is Hadoop?
Hadoop: Open Source Bazaar Style Dev.
• Hadoop was first conceived at Yahoo as a distributed file system (HDFS)
and a processing framework (MapReduce) for indexing the Internet.
• It worked so well that other Internet firms in the Silicon Valley started using
the open source software too.
• Apache Hadoop, by all accounts, has been a huge success on the open
source front.
• Hadoop project has spawned off into dozens of Apache projects
• Hive, Impala, Spark, HBase, Cassandra, Pig, Tez, etc.

Is there an Hadoop Standard?
Apache Software Foundation Hadoop
• Apache Software Foundation (ASF) is managing
Apache Hadoop
• The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming models.
• It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
• Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect
and handle failures at the application layer, so
delivering a highly-available service on top of a
cluster of computers, each of which may be prone
to failures.
• http://hadoop.apache.org/
Source: Apache Software Foundation
Open Data Platform Initiative
• ODPi defines itself as "a shared industry effort focused on promoting and
advancing the state of Apache Hadoop and big data technologies for the
enterprise."
• The group has grown its membership steadily since launching in February
2015 under the name Open Data Platform Alliance :
• Ampool, Altiscale, Capgemini, DataTorrent, EMC, GE, Hortonworks, IBM, Infosys,
Linaro, NEC, Pivotal, PLDT, SAS Institute Inc, Splunk, Squid Solutions, SyncSort,
Telstra, Teradata, Toshiba, UNIFi, Verizon, VMware, WANdisco, Xiilab, zData and
Zettaset.
• ODPi takes a major step forward by securing official endorsement by the Linux
Foundation turning it into a Linux Foundation collaborative project.
• Major companies against OdPi are Amazon, Cloudera, and Mapr
• Specifications
• ODPi runtime specification (march 2016) and ODPI Operations
Source: Odpi
Open Data Platform Initiative
• Objectives are :
• Reinforces the role of the Apache Software Foundation (ASF) in the
development and governance of upstream projects.
• Accelerates the delivery of Big Data solutions by providing a well-
defined core platform to target.
• Defines, integrates, tests, and certifies a standard "ODPi Core" of
compatible versions of select Big Data open source projects.
• Provides a stable base against which Big Data solution providers can
qualify solutions.
• Produces a set of tools and methods that enable members to create
and test differentiated offerings based on the ODPi Core.
• Contributes to ASF projects in accordance with ASF processes and
Intellectual Property guidelines.
• Supports community development and outreach activities that
accelerate the rollout of modern data architectures that leverage
Apache Hadoop®.
• Will help minimize the fragmentation and duplication of effort within
the industry.
Source: Odpi
Plan
• Why Now?
• What is Hadoop?
• Hadoop V1
Hadoop V1: Integration Options
Near Real-Time Batch & Scheduled
Integration Integration
Existing Infrastructure Existing Infrastructure
Logs & Databases & Applications & Visualization & Logs & Databases & Applications & Visualization &
Files Warehouses Spreadsheets Intelligence Files Warehouses Spreadsheets Intelligence
SQOOP Data Integration (Talend, Informatica)
Flume WebHDFS ODBC/JDBC REST
HCatalog
Pig Hive HBase Pig Hive HBase HCatalog
HDFS HDFS
MapReduce MapReduce
Source: HortonWorks
http://hadooper.blogspot.fr/ 233
Hadoop V1: Technology Elements
• Hive - A data warehouse infrastructure • HBase - A NoSQL columnar database for
than runs on top of Hadoop. Hive providing extremely hast scanning of column
supports SQL queries, star schemas, data for analytics.
partitioning, join optimizations, caching of • Scoop, Flume - tools providing large data
data, etc. ingestion for Hadoop using SQL,
streaming and REST API interfaces.
• Pig - A scripting language for processing
• Oozie - A workflow manager and scheduler.
Hadoop data in parallel.
• Zookeeper - A coordinator infrastructure
• MapReduce - Java applications that can
• Mahout - a machine learning library supporting
process data in parallel. Recommendation, Clustering, Classification and
• Ambari - An open source management Frequent Itemset mining.
interface for installing, monitoring and • Hue - is a Web interface that contains a file
managing a Hadoop cluster. Ambari has browser for HDFS, a Job Browser for YARN, an
also been selected as the management HBase Browser, Query Editors for Hive, Pig and
interface for OpenStack. Sqoop and a Zookeeper browser.

Hadoop V1: Technology Elements
Source: Octo Technology

Hadoop V1 Issues
• Availability
• Hadoop 1.0 Architecture had only one single point of availability i.e. the Job Tracker, so in case
if the Job Tracker fails then all the jobs will have to restart.
• Scalability
• The Job Tracker runs on a single machine performing various tasks such as Monitoring, Job
Scheduling, Task Scheduling and Resource Management.
• In spite of the presence of several machines (Data Nodes), they were not being utilized in an
efficient manner, thereby limiting the scalability of the system.
• Multi-Tenancy
• The major issue with Hadoop MapReduce that paved way for the advent of Hadoop YARN was
multi-tenancy. With the increase in the size of clusters in Hadoop systems, the clusters can be
employed for a wide range of models.
• Cascading Failure
• In case of Hadoop MapReduce when the number of nodes is greater than 4000 in a cluster,
some kind of fickleness is observed.
• The most common kind of failure that was observed is the cascading failure which in turn could
cause the overall cluster to deteriorate when trying to overload the nodes or replicate data via
network flooding.
Copyright © William El Kaim 2016 Source: Dezyre 236

Plan
• Why Now?
• What is Hadoop?
• Hadoop V2
Hadoop V2
• Hadoop (Hadoop 1.0) has progressed from a more restricted processing
model of batch oriented MapReduce jobs to developing specialized and
interactive processing models (Hadoop 2.0).
• Hadoop 2.0 popularly known as YARN (Yet another Resource Negotiator) is
the latest technology introduced in Oct 2013
Source: HortonWorks
Hadoop V2
• Apache™ Tez generalizes the MapReduce paradigm to a more powerful
framework for executing a complex DAG (directed acyclic graph) of tasks.
• By eliminating unnecessary tasks, synchronization barriers, and reads from and write to
HDFS, Tez speeds up data processing across both small-scale, low-latency and large-
scale, high-throughput workloads.
• Apache™ Slider is an engine that runs other applications in a YARN
environment.
• With Slider, distributed applications that aren’t YARN-aware can now participate in
the YARN ecosystem – usually with no code modification.
• Slider allows applications to use Hadoop’s data and processing resources, as well as the
security, governance, and operations capabilities of enterprise Hadoop.

Hadoop V2: YARN
• YARN (Yet Another Resource Negotiator) is
• the foundation for parallel processing in Hadoop.
• Scalable to 10,000+ data node systems.
• Supports different types of workloads such as batch, real-time queries (Tez), streaming,
graphing data, in-memory processing, messaging systems, streaming video, etc. You
can think of YARN as a highly scalable and parallel processing operating system that
supports all kinds of different types of workloads.
• Supports batch processing providing high throughput performing sequential read scans.
• Supports real time interactive queries with low latency and random reads.

Hadoop V2: Full Stack

Hadoop V2 (Another Stack Vision)

Hadoop V2: Spark Advantages
• Spark replaces MapReduce.
• MapReduce is inefficient at handling iterative algorithms as well as interactive data
mining tools.
• Spark is fast: uses memory differently and efficiently
• Run programs up to 100x faster than MapReduce in memory, or 10x faster on disk
• Spark excels at programming models
• involving iterations, interactivity (including streaming) and more.
• Spark offers over 80 high-level operators that make it easy to build parallel apps
• Spark runs Everywhere
• Runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources
including HDFS, Cassandra, HBase, S3.

Hadoop V2: Spark Revolution
Spark powers a stack of high-level tools including Spark

SQL, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in
the same application.

Hadoop V2: Spark Stack Evolutions (2015)
Goal: unified engine

across data sources,
workloads and
environments
DataFrame is a
distributed collection of
data organized into
named columns
ML pipeline to define a
sequence of data pre-
processing, feature
extraction, model fitting,
and validation stages
Source: Databricks

Hadoop V2: Upcoming Spark V2
• Spark programming revolved around the concept of a resilient distributed
dataset (RDD), which is a fault-tolerant collection of elements that can be
operated on in parallel.
• So the original Spark core API did not always feel natural for the larger population of data
analysts and data engineers, who worked mainly with SQL and statistical languages such as R.
• Today, Spark provides higher level APIs for advanced analytics and data science,
and supports five different languages, including SQL, Scala, Java, Python, R.
• What makes Spark quite special in the distributed computing arena is the fact that different
techniques such as SQL queries and machine learning can be mixed and combined together,
even within the same script.
• By using Spark, data scientists and engineers do not have to switch to different
environments and tools for data pre-processing, SQL queries or machine
learning algorithms. This fact boosts the productivity of data professionals and
delivers better and simpler data processing solutions.
Copyright © William El Kaim 2016 Source: Databricks 246

Spark V2: What’s New?
• Apache Spark Datasets: a high-level table-like data abstraction.
• Datasets feel more natural when reasoning about analytics and machine learning tasks,
and can be addressed both via SQL queries as well as programmatically via APIs.
• Programming APIs.
• Machine Learning has a big emphasis in this new release. spark.mllib package is
deprecated in favor of the new spark.ml package that focuses on pipeline based APIs
and is based on DataFrames.
• Machine Learning pipelines and models can now be persisted across all languages
supported by Spark.
• DataFrames and Datasets are now unified for Scala and Java programming languages
under the new Datasets class, which also serves as an abstraction for structured
streaming
• The new Structured Streaming API aims to allow managing streaming data sets without
added complexity.
• Performance has also improved with the second generation Tungsten engine, allowing
for up to 10 times faster execution.
Copyright © William El Kaim 2016 Source: Databricks 247

Hadoop V1 vs. V2
Source: HortonWorks
Video
Hadoop V1 vs. V2
• YARN has taken an edge over the cluster management responsibilities from
MapReduce
• now MapReduce just takes care of the Data Processing and other responsibilities are
taken care of by YARN.

Hadoop V1 vs. V2: Map Reduce vs. Tez vs.
Spark
Source: Slim Baltagi

Plan
• Why Now?
• What is Hadoop?
• Hadoop V3
Hadoop V3
• The Apache Hadoop project recently announced its 3.0.0-alpha1
release.
• Given the scope of a new major release, the Apache Hadoop community
decided to release a series of alpha and beta releases leading up to 3.0.0
GA.
• This gives downstream applications and end users an opportunity to test and provide
feedback on the changes, which can be incorporated during the alpha and beta process.
• The 3.0.0-alpha1 release incorporates thousands of new fixes,
improvements, and features since the previous minor release, 2.7.0, which
was released over a year ago.
• The full changelog and release notes are available on the Hadoop website, but we’d like
to drill into the major new changes that landed in 3.0.0-alpha1.

Plan
• Why Now?
• What is Hadoop?
• Hadoop Additional Services
Hadoop Performance Benchmark
• TPCx-BB is an Express Benchmark to measure the
performance of Hadoop based Big Data systems.
• It measures the performance of both hardware and software
components by executing 30 frequently performed analytical
queries in the context of retailers with physical and online store
presence.
• The queries are expressed in SQL for structured data and in
machine learning algorithms for semi-structured and unstructured
data.
• The SQL queries can use Hive or Spark, while the machine
learning algorithms use machine learning libraries, user defined
functions, and procedural programs.
• The latest TPCx-BB Specification / The benchmark kit
• TPC-DS queries which are largely based on SQL:2003
specification are now supported in Spark 2.0
Developing Applications on Hadoop
• Dedicated Application
Stack for Hadoop
• Casc
• Cascading
• Crunch
• Hfactory
• Hunk
• Spring for Hadoop

Developing Applications on Hadoop
Example: Casc

Hadoop Security
• Apache Ranger is a framework to enable, monitor and manage
comprehensive data security across the Hadoop platform.
• With the advent of Apache YARN, the Hadoop platform can now support a true data lake
architecture. Enterprises can potentially run multiple workloads, in a multi tenant
environment.
• Apache Metron provides a scalable advanced security analytics framework
built with the Hadoop Community evolving from the Cisco OpenSOC Project.
• A cyber security application framework that provides organizations the ability to detect
cyber anomalies and enable organizations to rapidly respond to identified anomalies.
• Apache Sentry is a system to enforce fine grained role based authorization
to data and metadata stored on a Hadoop cluster.

Hadoop Security
• Apache Eagle: Analyze Big Data
Platforms For Security and Performance
• Apache Eagle is an Open Source Monitoring
Platform for Hadoop ecosystem, which
started with monitoring data activities in
Hadoop.
• It can instantly identify access to sensitive
data, recognize attacks/malicious activities
and blocks access in real time.
• In conjunction with components (such as
Ranger, Sentry, Knox, DgSecure and Splunk
etc.), Eagle provides comprehensive solution
to secure sensitive data stored in Hadoop.
• As of 0.3.0, Eagle stores metadata and
statistics into HBASE, and support Druid
as metric store.

Hadoop Governance
Data Governance Initiative
• Enterprises adopting modern data architecture with Hadoop must reconcile
data management realities when they bring existing and new data from
disparate platforms under management.
• As customers deploy Hadoop into corporate data and processing
environments, metadata and data governance must be vital parts of any
enterprise-ready data lake.
• Data Governance Initiative (DGI)
• with Aetna, Merck, Target, and SAS
• Introduce a common approach to Hadoop data governance into the open source
community.
• Shared framework to shed light on how users access data within Hadoop while
interoperating with and extending existing third-party data governance and management
tools.
• A new project proposed to the apache software foundation: Apache Atlas
Hadoop Governance
Data Governance Initiative

Hadoop Governance
Apache Atlas and Apache Falcon
• Apache Atlas is a scalable and extensible set of core foundational
governance services
• It enables enterprises to effectively and efficiently meet their compliance requirements
within Hadoop and allows integration with the whole enterprise data ecosystem.
• Apache Falcon is a framework for managing data life cycle in Hadoop
clusters
• addresses enterprise challenges related to Hadoop data replication, business continuity,
and lineage tracing by deploying a framework for data management and processing.
• Falcon centrally manages the data lifecycle, facilitate quick data replication for business
continuity and disaster recovery and provides a foundation for audit and compliance by
tracking entity lineage and collection of audit logs.

Hadoop Governance
Apache Atlas Capabilities
Copyright © William El Kaim 2016 Source: Apache Atlas 262

Hadoop Governance
Other Vendors Entering The Market
• Alation
• Cloudera Navigator
• Collibra
• Informatica Big Data Management
• Podium Data
• Zaloni

Hadoop Governance
Cloudera Navigator
Copyright © William El Kaim 2016 https://www.cloudera.com/products/cloudera-navigator.html 264

Plan
• Why Now?
• What is Hadoop?
• Market
Big Data Market
• The global big data market will grow from $18.3 billion in 2014 to a whopping
$92.2 billion by 2026 (4.4% annual growth rate).
• 2015 “a breakthrough year for big data”, with the market growing by 23.5
percent, led mainly by Hadoop platform revenues.
• Explosive growth of Hortonworks Inc. and other Hadoop vendors, as well as the rapid
adoption of Apache Spark and other streaming technologies.
• This growth in big data is being fueled by a desire among larger enterprises to become
more data-driven, as well as the emergence of new, Web-based, cloud-native startups
like AirBnB Inc., Netflix Inc. and Uber Technologies Inc., which were conceived with big
data at their core.
• 2016 – 2026 Worldwide big data Market Forecast by Wikibon.

Big Data Market
• Wikibon breaks down global big data market revenues into three segments:
professional services (40% of all revenues in 2015), hardware (31%) and
software (29%).
• Wikibon’s projection for 2026 shows a markedly different split, with rapid growth in big
data-related software set to ensure that that segment overtakes the other two to account
for 46% of all big data spending in the next ten years, with professional services at 29%
and hardware at 25%.
• This shift will occur due to the development of better quality software that reduces the need for
big data-related services.
• 2016 – 2026 Worldwide big data Market Forecast by Wikibon.

Big Data in Public Cloud Market
• Worldwide Big Data revenue in the public cloud
• was $1.1B in 2015 and will grow to $21.8B by 2026
• Will grow from 5% of all Big Data revenue in 2015 to 24% of all Big Data spending by
2026.
• However, the report highlights ongoing regulatory concerns as well as the
structural impediment of moving large amounts of data offsite as inhibitors to
mass adoption of Big Data deployments in the public cloud.
• “Big Data in the Public Cloud Forecast, 2016-2026” by Wikibon

Big Data Market: 2014-2016 ($B)

Hadoop Market
• According to Wikibon’s latest market analysis
• spending on Hadoop software and subscriptions accounted for a mere $187 million in
2014
• less than 1 percent of $27.4 billion in overall big data spending.
• Hadoop spending on software and subscriptions to grow to $677 million by 2017 when
the overall big data market will have grown to $50 billion = 1% and if you include
professional services, it more than doubles to about 3%.
• Source
• Wikibon’s Big Data Vendor Revenue and Market Forecast 2011-2020 report.

Plan
• Why Now?
• What is Hadoop?
• Tools landscape
Big Data Landscape
Hadoop Distributions and Providers
• Three Main pure-play Hadoop distributors
• Cloudera, Hortonworks, and MapR Technologies
• Other Hadoop distributors
• SyncFusion: Hadoop for Windows,
• Pivotal Big Data Suite
• Pachyderm
• Hadoop Cloud Provider:
• Altiscale, Amazon EMR, BigStep, Google Cloud DataProc, HortonWorks SequenceIQ,
IBM BigInsights, Microsoft HDInsight, Oracle Big Data, Qubole, Rackspace
• Hadoop Infrastructure as a Service
• BlueData, Packet

Big Data Landscape
Hadoop Distributions and Providers
Forrester Big Data Hadoop Distributions, Q1 2016 Forrester Big Data Hadoop Cloud, Q1 2016

Big Data Landscape
Hadoop Distributions To Start With
• Apache Hadoop
• Cloudera Live
• Dataiku
• Hortonworks Sandbox
• IBM BigInsights
• MapR Sandbox
• Microsoft Azure HDInsight
• Syncfusion Hadoop for Windows
• W3C Big Data Europe Platform

Source: Matt Turck

Big Data Landscape
Cloud Provisioning: HortonWorks CloudBreak
• CloudBreak is a tool for provisioning Hadoop clusters on public cloud
infrastructure and for optimizing the use of cloud resources with elastic
scaling
• Part of the HortonWorks Data Platform and powered by Apache Ambari, CloudBreak
allows enterprises to simplify the provisioning of clusters in the cloud.

Source: HortonWorks 276
Big Data Landscape
Pure-play Hadoop Distributors: SyncFusion

https://www.syncfusion.com/products/big-data 277
Big Data Landscape
Pure-play Hadoop Distributors: Pivotal Big Data Suite

http://pivotal.io/big-data/pivotal-big-data-suite 278
Big Data Landscape
Microsoft Azure HDInsight
Copyright © William El Kaim 2016 https://azure.microsoft.com/en-us/services/hdinsight/ 279

Big Data Landscape
SQL Server 2016 Polybase
• PolyBase is a technology that accesses
and combines both non-relational and
relational data, all from within SQL
Server.
• Allows to run queries on external data in Hadoop
or Azure blob storage.
• The queries are optimized to push computation to
Hadoop
• By simply using Transact-SQL (T-SQL)
statements, you an import and export
data back and forth between relational
tables in SQL Server and non-relational
data stored in Hadoop or Azure Blob
Storage. Y
• You can also query the external data from within a
T-SQL query and join it with relational data.

https://msdn.microsoft.com/en-us/library/mt163689.aspx 280
Big Data Landscape
Pachyderm: Hadoop Alternative Container Based
• San Francisco-based company founded in 2014
• raised $2 million from Data Collective, Blumberg Capital,
Foundation Capital, and others.
• The Pachyderm stack uses Docker containers as well as CoreOS and
Kubernetes for cluster management.
• In Hadoop, people write their jobs in Java, and it all runs on the
JVM.
• It replaces
• MapReduce with Pachyderm Pipelines.
• You create a containerized program with the tools of your choice that
reads and writes to the local filesystem.
• HDFS with its Pachyderm File System
• Distributed file system (inspired from git), providing version control over
all the data.
• Data is stored in generic object storage such as Amazon’s S3, Google
Cloud Storage or the Ceph file system.
• Provides historical snapshots of how you data looked at different points in
time.
Copyright © William El Kaim 2016 http://www.pachyderm.io/ 281

Big Data Landscape
PNDA: big data analytics platform for networks and services.
Copyright © William El Kaim 2016 http://pnda.io/ 282

Big Data Landscape
BlueData: Hadoop As a Service Container Based
• BlueData will offer a Big-
Data-as-a-Service
(BDaaS) software platform
• can deliver any Big Data
distribution and application
on any infrastructure,
whether on-premises or in
the public cloud.
• Use Docker containers
(secure, embedded, and
fully managed) to be
agnostic about the
infrastructure – whether
physical server, virtual
machine, and now cloud at
scale
Copyright © William El Kaim 2016 Source: BlueData 283

Big Data Landscape
Hadoop Infrastructure as a Service: BlueData

Source: BlueData 284
Big Data Landscape
Big Data As a Service: Qubole
Copyright © William El Kaim 2016 http://www.qubole.com/ 285

Big Data Landscape
Big Data As a Service: BigStep
Real Time
Batch
Copyright © William El Kaim 2016 http://bigstep.com/solutions/architectures 286

Big Data Landscape
Cloud Provisioning: Apache Ambari + Apache Brooklyn
Apache Brooklyn is
an application
blueprinting and
management system
which supports a
wide range of
software and services
in the cloud.
Copyright © William El Kaim 2016 Source: TheNewStack 287

Big Data Landscape
Collecting and querying Hadoop Metrics

Big Data Landscape
Big Data application performance monitoring (APM)
Copyright © William El Kaim 2016 Source: Driven 289

Big Data Landscape
Ingestion Technologies: Apache NiFi
Provides scalable
directed graphs of
data routing,
transformation,
and system
mediation logic

https://nifi.apache.org/ 290
Big Data Landscape
Data Wrangling
Paxata Trifacta

Big Data Landscape
Data Preparation
Source: Bloor Research

Big Data Landscape
Open Source Hadoop RDBMS: Splice Machine
Copyright © William El Kaim 2016 http://www.splicemachine.com/ 293

Big Data Landscape
Data integration On Demand: Xplenty
Copyright © William El Kaim 2016 https://www.xplenty.com/#features 294

Big Data Landscape
Data integration On Demand: StreamSets Data Collector
https://streamsets.com/ 295
Big Data Landscape
Data integration On Demand: SnapLogic
Copyright © William El Kaim 2016 http://www.splicemachine.com/ 296

Big Data Landscape
Enterprise Integration: Informatica Vibe
Informatica Vibe allows users to create data-integration mappings once, and then run
them across multiple platforms.

Big Data Landscape
Enterprise Integration: Tibco ActiveMatrix BusinessWorks
TIBCO
ActiveMatrix
BusinessWorks 6
+ Apache Hadoop
= Big Data
Integration
Copyright © William El Kaim 2016 Source: Tibco 298

Big Data Landscape
IBM DataWorks
• Available on Bluemix, IBM’s Cloud platform, DataWorks integrate and
leverage Apache Spark, IBM Watson Analytics, and the IBM Data Science
Experience.
• It is designed to help organizations:
• Automate the deployment of data assets and products using cognitive-based machine
learning and Apache Spark;
• Ingest data faster than any other data platform, from 50 to hundreds of Gbps, and all
endpoints: enterprise databases, Internet of Things, weather, and social media;
• Leverage an open ecosystem of more than 20 partners and technologies, such as
Confluent, Continuum Analytics, Galvanize, Alation, NumFOCUS, RStudio, Skymind,
and more.
• Additionally, DataWorks is underpinned by core cognitive capabilities, such as cognitive-
based machine learning. This helps speed up the process from data discovery to model
deployment, and helps users uncover new insights that were previously hidden to them.
Copyright © William El Kaim 2016 Source: IBM Dataworks 299

Big Data Landscape
IBM DataWorks
Copyright © William El Kaim 2016 Source: IBM Dataworks 300

Big Data Landscape
Data Science: Anaconda Platform
Copyright © William El Kaim 2016 https://www.continuum.io/ 301

Big Data Landscape
Data Science: Dataiku DSS
Combine and Join Datasets
Create Machine Learning Models

http://www.dataiku.com/dss/ 302
Big Data Landscape
Data Science: Datameer
http://datascience.ibm.com/ 303
Big Data Landscape
Data Science: IBM Data Science Experience
• Data Science Experience is a cloud-
based development environment for
near real-time, high performance
analytics
• Available on IBM Cloud Bluemix platform
• Provides
• 250 curated data sets
• open source tools and a collaborative
workspace like H2O, RStudio, Jupyter
Notebooks on Apache Spark
• in a single security-rich managed
environment.
• Help data scientists uncover and share
meaningful insights with developers,
making it easier to rapidly develop
applications that are infused with
intelligence.
http://datascience.ibm.com/ 304
Big Data Landscape
Data Science: Tamr
http://www.tamr.com/
Big Data Landscape
Machine Learning as A Service
• Open Source
• Accord (Dotnet), Apache Mahout, Apache Samoa, Apache Spark MLlib and Mlbase,
Apache SystemML, Cloudera Oryx, GoLearn (Go), H20, Photon ML, Prediction.io, R
Hadoop, Scikit-learn (Python), Seldon, Shogun (C++), Google TensorFlow, Weka.
• Available as a Service
• Algorithmia, Algorithms.io, Amazon ML, BigML, DataRobot, FICO, Google Prediction
API, HPE Haven OnDemand, IBM’s Watson Analytics, Microsoft Machine Learning
Studio, PurePredictive, Predicsis, Yottamine.
• Examples
• BVA with Microsoft Azure ML
• Quick Review of Amazon Machine Learning
• BigML training Series
• Handling Large Data Sets with Weka: A Look at Hadoop and Predictive Models

Scalable Data Science with R
• Hadoop: Analyze data with Hadoop through R code (Rhadoop)
• rhdfs to interact with HDFS systems;
• rhbase to connect with Hbase;
• plyrmr to perform common data transformation operations over large datasets;
• rmr2 that provides a map-reduce API;
• and ravro that writes and reads avro files.
• Spark: with SparkR
• It is possible to use Spark’s distributed computation engine to enable large-scale data
analysis from the R shell. It provides a distributed data frame implementation that
supports operations like selection, filtering, aggregation, etc., on large data sets.
• Programming with Big Data in R
• Programming with Big Data in R" project (pbdr) is based on MPI and can be used on
high-performance computing (HPC) systems, providing a true parallel programming
environment in R.
Federico Castanedo
Scalable Data Science with R
• After the data preparation step, the next common data science phase
consists of training machine learning models, which can also be performed
on a single machine or distributed among different machines.
• In the case of distributed machine learning frameworks, the most popular
approaches using R, are the following:
• Spark MLlib: through SparkR, some of the machine learning functionalities of Spark are
exported in the R package.
• H2o framework: a Java-based framework that allows building scalable machine learning
models in R or Python.
• Apache MADlib (incubating): Big Data Machine Learning in SQL

Big Data Landscape
Business Intelligence and Analytics Platforms

Big Data Landscape
Business Intelligence and Analytics Platforms
• Tableau, Qlikview and Jethro (SQL Acceleration Engine for BI on Big Data
compatible with BI tools like Tableau and Qlik).
• Alteryx, Birst, Datawatch, Domo, GoodData, Looker, PyramidAnalytics,
Saagie and ZoomData are increasingly encroaching on the territory once
claimed by Qlik and Tableau.
• At the same time, a new crop of Hadoop and Spark data based BI tools from
the likes of Platfora, Datameer, and Clearstory Data appeared on the market.
• And the old guard is still there: Sap Lumira, Microsoft PowerBI, SAS Visual
Analytics
• And open source tools like Datawrapper

Big Data Landscape
Business Intelligence and Analytics Platforms: Saagie

https://www.saagie.com/products 311
Big Data Landscape
Hadoop for Data Analytics and Use: Apache Zeppelin
Apache Zeppelin is a web-based notebook that enables
interactive data analytics. You can make beautiful data-
driven, interactive and collaborative documents with SQL,
Scala and more.
http://zeppelin.incubator.apache.org

Big Data Landscape
Dynamic Data Warehouse
Copyright © William El Kaim 2016 http://http://www.infoworks.io/ 313

Big Data Landscape
Data Visualization Software
Visual analytics is the act
of finding meaning in
data using visual
artifacts such as charts,
graphs, maps and
dashboards. In addition,
the user interface is
typically driven by drag
and drop actions using
wholly visual constructs.

Big Data Landscape
Data Visualization Software
• Four dominant modes of analysis: descriptive (traditional BI), discovery
(looking for unknown facts), predictive (finding consistent patterns that can
be used in future activities), and prescriptive (actions that can be taken to
improve performance).
• BeyondCore, BIME, ClearStory,
DOMO, GoodData, Inetsoft, InfoCaptor,
Logi Analytics, Looker, Microsoft Power
BI, Microstrategy, Prognoz, Qlik Sense,
SAP Lumira, SAS Visual Analytics,
Sisense, Spotfire, Tableau, ThoughtSpot,
Yellowfin.
Source: ButlerAnalytics
Source: ButlerAnalytics
Big Data Landscape
Dataviz Tools
• For Non Developers
• ChartBlocks, Infogram, Plotly, Raw, Visual.ly
• For Developers
• D3.js, Infovis, Leaflet, NVD3, Processing.js, Recline.js, visualize.js
• Chart.js, Chartist.js, Ember Charts , Google Charts, FusionCharts, Highcharts, n3-
charts, Sigma JS, Polymaps
• More
• Datavisuaisation.ch curated list
• ProfitBricks list
• Dedicated library are also available for Python, Java, C#, Scala, etc.

Big Data Landscape
Other Interesting Tools
• Storage
• Druid is an open-source analytics data store designed for OLAP queries on time series
data (trillions of events, petabytes of data).
• OpenTSDB (HBase) and Kairos: Time-series databases built on top of open-source
nosql data stores.
• Aerospike, VoltDB: Database software for handling large amounts of real-time event
data.
• Services
• SyncSort Hadoop ETL Solution extends the capabilities of Hadoop.
• Snowplow is an Event Analytics Platform.
• IT monitoring: Graphistry, Splunk, SumoLogic, ScalingData, and CloudPhysics.
• Modern monitoring platform using streaming analytics: Anodot, Graphite, DR-Elephant,
and SignalFx
Copyright © William El Kaim 2016 http://www.snaplogic.com/ 317

Plan
• Why Now?
• What is Hadoop?
• Big Data Ecosystem For Science
Big Data Ecosystem For Science
• Large-scale data management is essential for experimental science and has
been for many years. Telescopes, particle accelerators and detectors, and
gene sequencers, for example, generate hundreds of petabytes of data that
must be processed to extract secrets and patterns in life and in the universe.
• The data technologies used in these various science communities often
predate those in the rapidly growing industry big data world, and, in many
cases, continue to develop independently, occupying a parallel big data
ecosystem for science, supported by the National Energy Research
Scientific Computing Centre (NERSC).
• Across these projects we see a common theme: data volumes are growing,
and there is an increasing need for tools that can effectively store and
process data at such a scale.
• In some cases, the projects could benefit from big data technologies being developed in
industry, and in some other projects, the research itself will lead to new capabilities.
Copyright © William El Kaim 2016 Source: Wahid Bhimji on O’Reilly 319

Copyright © William El Kaim 2016 Source: Wahid Bhimji on O’Reilly 320

Other Interesting Services
• Data Format
• ROOT offers a self-describing binary file format with huge flexibility for serialization of
complex objects and column-wise data access.
• HDF5 format to enable more efficient processing of simulation output due to the parallel
input/output (I/O) capabilities
• Data Federation
• XrootD data access protocol, which allow all of data to be accessed in a single global
namespace and served up in a mechanism that is both fault-tolerant and offering high-
performance.
• Data Management
• Big PanDA run analyses that allow thousands of collaborators to run hundreds of
thousands of processing steps on exabytes of data as well as monitor and catalog that
activity.

Plan
• Why Now?
• What is Hadoop?
Hadoop Architecture
Data Sourcing Data Preparation Feature Preparation
Metadata Dataiku, Tableau,

Open Python, R, etc.
Data Data Science Tools
Batch SQL & NoSQL & Platforms
Source Data Database
Data Driven
Operational Business
Systems Qlik, Tibco, IBM, Process,
(ODS, IoT) SAP, BIME, etc. Applications
Computed BI Tools & Platforms and Services
Data Hadoop
Existing Sources Streaming Spark
of Data Cascading, Crunch,
(Databases, Hfactory, Hunk, Spring
DW, DataMart) for Hadoop
Data
Data Sources Ingestion Data Lake Lakeshore App. Services

Hadoop Technologies
Data Sourcing Data Preparation Feature Preparation
Encoding Format: JSON, Distributed File Data Science: Dataiku, Cassandra, Druid, DynamoDB,
Rcfile, Parquet, ORCfile System: GlusterFS, Datameer, Tamr, R, SaS, MongoDB, Redshift, Google
Open BigQuery, etc.
HDFS, Amazon S3, Python, RapidMiner, etc.
Data
MapRFS, ElasticSearch Machine Learning: Data Warehouse
Batch BigML, Mahout,
Map Reduce Predicsys, Azure ML,
Operational TensorFlow, H2O, etc.
Systems Streaming Qlik, Tableau, Tibco, Jethro,
(ODS, IoT) Event Stream & Micro Batch Looker, IBM, SAP, BIME, etc.
BI Tools & Platforms
Ingestion Technologies: NoSQL Databases: Distributions: Cloudera,
Existing Sources Apex, Flink, Flume, Cassandra, Ceph, HortonWorks, MapR,
of Data Kafka, Amazon Kinesis, DynamoDB, Hbase, SyncFusion, Amazon Cascading, Crunch, Hfactory,
(Databases, Nifi, Samza, Spark, Hive, Impala, Ring, EMR, Azure HDInsight, Hunk, Spring for Hadoop,
DW, DataMart) Sqoop, Scribe, Storm, OpenStack Swift, etc. Altiscale, Pachyderm, D3.js, Leaflet
NFS Gateway, etc. Qubole, etc.
App. Services
Data Sources Data Ingestion Data Lake Lakeshore & Analytics Analytics App and Services

Intuit Example: Initial Cloud Platform

Intuit Example: Initial Platform Concerns
Key data sources

1. Clickstream
2. Transactional user-entered data
3. Back office data and insights
Key cross-cutting concerns

4. Traceability – customer ID,
transaction ID
5. REACTive platform architecture
6. Analytics infrastructure
7. Model congruity
8. Sources of truth

Intuit Example: Revised Platform

Uber Initial Usage of Hadoop
• Uber relied on Kafka data feeds to bulk-load log data into Amazon S3, and
used EMR to process that data.
• It then moved the “gold-plated” output from EMR into its relational
warehouse, which is accessed by internal users and the city-level directors
leading Uber’s expansion around the world.
• The Celery/Python-based ETL system the company built to load the data
warehouse “worked pretty well,” but then Uber ran into scale issues
• As they added more cities, as the scale increased, Uber hit a bunch of problems in
existing systems, particularly around the batch-oriented upload of data.

Uber Example: Initial Usage of Hadoop

Uber Example: Architecture Issues
• The Celery/Python-based ETL system the company built to load the data
warehouse “worked pretty well,” but then Uber ran into scale issues
• As they added more cities, as the scale increased, Uber hit a bunch of problems in
existing systems, particularly around the batch-oriented upload of data.
• Uber needed to ensure that the “trips” data, which documents the hundreds
of thousands of actual car rides that Uber drivers give each day and is critical
for accurately paying drivers, was ready to be consumed by downstream
users and applications.
• The system wasn’t built for multiple data centers.

Uber Example: New Hadoop Architecture
• The solution involved a new Spark-based system called streamIO that
replaced the Celery/Python ETL system.
• The new system essentially decouples the raw data ingest from the relational
data warehouse table model
• by pushing raw data onto HDFS and then later rely on something like Spark that can do
very large scale processing to figure out the transformations later on.
• So instead of trying to aggregate the trip data from multiple distributed data
centers in a relational model, Uber’s new system
• uses Kafka to stream change-data logs from the local data centers, and loads them into
the centralized Hadoop cluster.
• uses Spark SQL to convert the schema-less JSON data into more structured Parquet
files, which form the basis for SQL-powered analysis done using Hive.
• Learn more on youtube
• https://www.youtube.com/watch?v=zKbds9ZPjLE
Uber Example: Usage Of Hadoop

StreamSets Example: Apache KUDU
Usage
Copyright © William El Kaim 2016 Source: StreamSets Blog 333

Snowplow Example: Unified Log Analytics
Before: Batch-based,
Normally run overnight,
Sometimes every 4-6 hours
Copyright © William El Kaim 2016 Source: Snowplow 334

Snowplow Example: Unified Log Archi. on AWS
Copyright © William El Kaim 2016 Source: Snowplow 335

Netflix Example: Streaming Architecture
Today
Before
Future
Apache Kafka is a publish-

subscribe messaging
rethought as a distributed
commit log
Copyright © William El Kaim 2016 Source: Netflix 336

Edmunds.com Near Real-Time Dashboard

Source: Cloudera 337
EA Digital Codex Twitter
http://www.eacodex.com/ http://www.twitter.com/welkaim
Linkedin SlideShare
http://fr.linkedin.com/in/williamelkaim http://www.slideshare.net/welkaim
Claudine O'Sullivan

Big Data Architecture Re-Invented

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Big Data Architecture Re-Invented

Hochgeladen von

Copyright:

Verfügbare Formate

(Big-)Data Architecture (Re-)Invented

Copyright © William El Kaim 2016 http://www.eacodex.com/ 2

Copyright © William El Kaim 2016 Source: Domo 4

Copyright © William El Kaim 2016 6

Copyright © William El Kaim 2016 7

Copyright © William El Kaim 2016 Source: SiSense 9

Copyright © William El Kaim 2016 Source: James Higginbotham 11

Copyright © William El Kaim 2016 Source: Bernard Marr 12

Copyright © William El Kaim 2016 Source: IBM 13

Copyright © William El Kaim 2016 Source: IBM 14

Copyright © William El Kaim 2016 Source: IBM 15

Copyright © William El Kaim 2016 Source: IBM 16

Copyright © William El Kaim 2016 Source: Bernard Marr 17

• De-centralized/multi-server architectures management

Copyright © William El Kaim 2016 18

Copyright © William El Kaim 2016 20

Source: Bernard Marr

Copyright © William El Kaim 2016 21

Source: Robin Purohit

Copyright © William El Kaim 2016 23

Copyright © William El Kaim 2016 24

Copyright © William El Kaim 2016 25

Copyright © William El Kaim 2016 http://web.panoply.io/ 26

Copyright © William El Kaim 2016 28

“It is an important technique!”

Copyright © William El Kaim 2016 Source: Xiaoxiao Shi 29

• Open Source Apache Project

Copyright © William El Kaim 2016 30

Copyright © William El Kaim 2016 Source: DBA Journey Blog 31

Copyright © William El Kaim 2016 Source: Octo Technology 32

Copyright © William El Kaim 2016 34

2 Data Analytics and Use

Source: Octo Technology

Copyright © William El Kaim 2016 35

Copyright © William El Kaim 2016 Source: Microsoft 36

Copyright © William El Kaim 2016 Source: Microsoft 37

Six V to Nirvana to Provide

• Hindsight (what happened?)

Copyright © William El Kaim 2016 38

From Hindsight to Insight to Foresight

Copyright © William El Kaim 2016 39

From Hindsight to Insight to Foresight

Copyright © William El Kaim 2016 40

From Data Management To Data Driven Decisions

Copyright © William El Kaim 2016 Source: Reltio 41

From Data Management To Data Driven Decisions

Copyright © William El Kaim 2016 Source: Reltio 42

Copyright © William El Kaim 2016 Source: Microsoft 43

Copyright © William El Kaim 2016 44

Copyright © William El Kaim 2016 Source: Edureka 45

Capture Big Data Process Distribute Results Feedback

Copyright © William El Kaim 2016 Source: HortonWorks 47

Copyright © William El Kaim 2016 Source: HortonWorks 48

Raw Data Data Lake Lakeshore Data Science Data

Copyright © William El Kaim 2016 49

Free Community Edition

Copyright © William El Kaim 2016 http://www.dataiku.com/dss/ 50

Copyright © William El Kaim 2016 Source: Cirrus Shakeri 51

Copyright © William El Kaim 2016 Source: McKinsey 53

Copyright © William El Kaim 2016 Source: IBM 54

Copyright © William El Kaim 2016 Source: Zaloni 56

Copyright © William El Kaim 2016 Source: Zaloni 57

Copyright © William El Kaim 2016 Source: Zaloni 58

Copyright © William El Kaim 2016 Source: PWC 59