Beruflich Dokumente
Kultur Dokumente
William El Kaim
Dec. 2016 – V3.3
This Presentation is part of the
Enterprise Architecture Digital Codex
Visualization
Big Data: A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications
Big Data: When the data could not fit in Excel. Used to be 65,536 lines, Now 1,048,577.
Big Data: When it's cheaper to keep everything than spend the effort to decide what to throw
away (David Brower @dbrower)
Source: Capgemini
Doug Cutting
Extended
Apache Nutch
2
Big Data
Analytics
and Use
1
Data Mgt.
And Storage
ETL Pre-processor
ETL Pre-processor
• Shift the pre-processing of ETL in staging data warehouse to Hadoop
• Shifts high cost data warehousing to lower cost Hadoop clusters
Massive Storage
Massive Storage
• Offloading large volume of historical data into cold storage with Hadoop
• Keep data warehouse for hot data to allow BI and analytics
• When data from cold storage is needed, it can be moved back into the warehouse
Data Discovery
Data Discovery
• Keep data warehouse for operational BI and analytics
• Allow data scientists to gain new discoveries on raw data (no format or structure)
• Operationalize discoveries back into the warehouse
Skills Needed
Sensors,
devices Business Dashboards,
Intelligence Reports,
& Analytics Visualization, …
DB data
3. Data Processing /
Analysis Layer
External Data
2. Data Storage
Layer
1. Data Source
Layer
1 2 3 3 4
Copyright © William El Kaim 2016 Source: The Modeling Agency and sv-europe 52
How to Implement Hadoop?
McKinsey Seven Steps Approach
Copyright © William El Kaim 2016 Source: Datanami & Cloudera & Dremio 89
What is BI on Hadoop?
Three Options: SQL-on-Hadoop
• SQL on Hadoop tools could be categorized as
• Interactive or Native SQL
• Batch & Data-Science SQL
• OLAP Cubes (In-memory) on Hadoop
Source: Forrester
• Observations • Labels
• Items or entities used for learning or – Values / categories assigned to
evaluation (e.g., emails) observations (e.g., spam, not-spam)
• Features • Training and Test Data
• Attributes (typically numeric) used to – Observations used to train and
represent an observation (e.g. length, evaluate a learning algorithm (e.g., a
date, presence of keywords) set of emails along with their labels)
– Training data is given to the algorithm
for training while Test data is withheld
at train time
Cloud
Source: Microsoft
Copyright © William El Kaim 2016 117
Big Data: Azure Machine Learning
Map
HBase Impala Hawq Map Reduce /
Spark Reduce / Stinger Giraph Hama Spark
Tez
Tez
Machine
OLAP OLTP
Learning
Processing
Hadoop Distributed
Storage
Openstack
GlusterFS HDFS S3 MapR Isilon Cassandra DynamoDB Ceph Ring Swift
Source: Databricks
Source: Ippon
Source: Ippon
Source: Apache
Copyright © William El Kaim 2016 146
Ingestion Technologies
Apache Flink Commercial Support
Batch processing
• Large amount of statics data
• Generally incurs a high-latency / Volume
Real-time processing
• Compute streaming data
• Low latency
• Velocity
Hybrid computation
• Lambda Architecture
• Volume + Velocity
Copyright © William El Kaim 2016 Source: Rubén Casado & Cloudera 173
Hadoop Processing Paradigms & Time
• Scalable
• Large amount of static data
• Distributed
Volume
• Parallel
• Fault tolerant
• High latency
• Low latency
• Continuous
unbounded
Velocity streams of data
• Distributed
• Parallel
• Fault-tolerant
Volume Velocity
Hybrid
Real-Time
Source: Kreps
Kappa
Source: Ericsson
Source: Forrester
• Enterprise architects whose companies are pursuing a big data strategy can
benefit from a big data fabric implementation that automates, secures,
integrates, and curates big data sources intelligently.
• Your big data fabric strategy should:
• Integrate only a few big data sources at first.
• Start top-down rather than bottom-up, keeping the end in mind.
• Separate analytics from data management. Analytics tools should focus primarily on
data visualization and advanced statistical/data mining algorithms with limited
dependence on data management functions. Decoupling data management from data
analytics reduces the time and effort needed to deliver trusted analytics.
• Create a team of experts to ensure success.
• Use automation and machine learning to accelerate deployment..
• ESRI • Geomesa
• ESRI for Big Data • GeoMesa is an open-source,
distributed, spatio-temporal database
• Esri GIS tools for Hadoop: Toolkit built on Accumulo, HBase, Cassandra,
allowing developers to build analytical and Kafka.
tools leveraging both Hadoop and
Arcgis.
• SpatialHadoop
• Esri User Defined Functions built on top of
• open source MapReduce extension
the Esri Geometry API
designed specifically to handle huge
datasets of spatial data on Apache
• Pigeon: spatial extension to Pig that Hadoop.
allows it to process spatial data.
• SpatialHadoop is shipped with built-in
• Hive Spatial Query: adds spatial high level language, spatial
geometric user-defined data types, spatial indexes and
functions(UDFs) to Hive. efficient spatial operations.
• GeoDataViz • Redis
• CartoDB • Open source (BSD licensed), in-
• Deep Insights technology is capable of memory data structure store, used as
handling and visualizing massive amounts database, cache and message broker. It
of contextual and time based location data. supports data structures such
• Spatialytics as strings, hashes, lists, sets,sorted
• Standard geoBI platform sets with range queries, bitmaps,
• mapD hyperloglogs and geospatial indexes with
• Leverage GPU and a dedicated NoSQL
radius queries.
database for better performance • Tutorial / Examples
• deck.gl (Uber) • How To Analyze Geolocation Data with
• WebGL-powered framework for visual Hive and Hadoop – Uber trips
exploratory data analysis of large datasets.
• Geo spatial data support for Hive using
• Data Converter Taxi data in NYC
• ESRI GeoJSon Utils • ESRI Wiki
• GDAL: Geospatial Data Abstraction
Library
Source: Odpi
Copyright © William El Kaim 2016 229
Is there an Hadoop Standard?
Open Data Platform Initiative
• Objectives are :
• Reinforces the role of the Apache Software Foundation (ASF) in the
development and governance of upstream projects.
• Accelerates the delivery of Big Data solutions by providing a well-
defined core platform to target.
• Defines, integrates, tests, and certifies a standard "ODPi Core" of
compatible versions of select Big Data open source projects.
• Provides a stable base against which Big Data solution providers can
qualify solutions.
• Produces a set of tools and methods that enable members to create
and test differentiated offerings based on the ODPi Core.
• Contributes to ASF projects in accordance with ASF processes and
Intellectual Property guidelines.
• Supports community development and outreach activities that
accelerate the rollout of modern data architectures that leverage
Apache Hadoop®.
• Will help minimize the fragmentation and duplication of effort within
the industry.
Source: Odpi
Copyright © William El Kaim 2016 230
Plan
• Taming The Data Deluge
• What is Big Data?
• Why Now?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• What is a Data Lake?
• What is BI on Hadoop?
• What is Big Data Analytics?
• Big Data Technologies
• Hadoop Distributions & Tools
• Hadoop V1
• Hadoop Architecture Examples
Copyright © William El Kaim 2016 231
Hadoop V1: Integration Options
Near Real-Time Batch & Scheduled
Integration Integration
Logs & Databases & Applications & Visualization & Logs & Databases & Applications & Visualization &
Files Warehouses Spreadsheets Intelligence Files Warehouses Spreadsheets Intelligence
HCatalog
HDFS HDFS
MapReduce MapReduce
Source: HortonWorks
Copyright © William El Kaim 2016 232
http://hadooper.blogspot.fr/ 233
Copyright © William El Kaim 2016
Hadoop V1: Technology Elements
• Hive - A data warehouse infrastructure • HBase - A NoSQL columnar database for
than runs on top of Hadoop. Hive providing extremely hast scanning of column
supports SQL queries, star schemas, data for analytics.
partitioning, join optimizations, caching of • Scoop, Flume - tools providing large data
data, etc. ingestion for Hadoop using SQL,
streaming and REST API interfaces.
• Pig - A scripting language for processing
• Oozie - A workflow manager and scheduler.
Hadoop data in parallel.
• Zookeeper - A coordinator infrastructure
• MapReduce - Java applications that can
• Mahout - a machine learning library supporting
process data in parallel. Recommendation, Clustering, Classification and
• Ambari - An open source management Frequent Itemset mining.
interface for installing, monitoring and • Hue - is a Web interface that contains a file
managing a Hadoop cluster. Ambari has browser for HDFS, a Job Browser for YARN, an
also been selected as the management HBase Browser, Query Editors for Hive, Pig and
interface for OpenStack. Sqoop and a Zookeeper browser.
Source: HortonWorks
Copyright © William El Kaim 2016 238
Hadoop V2
• Apache™ Tez generalizes the MapReduce paradigm to a more powerful
framework for executing a complex DAG (directed acyclic graph) of tasks.
• By eliminating unnecessary tasks, synchronization barriers, and reads from and write to
HDFS, Tez speeds up data processing across both small-scale, low-latency and large-
scale, high-throughput workloads.
• Apache™ Slider is an engine that runs other applications in a YARN
environment.
• With Slider, distributed applications that aren’t YARN-aware can now participate in
the YARN ecosystem – usually with no code modification.
• Slider allows applications to use Hadoop’s data and processing resources, as well as the
security, governance, and operations capabilities of enterprise Hadoop.
DataFrame is a
distributed collection of
data organized into
named columns
ML pipeline to define a
sequence of data pre-
processing, feature
extraction, model fitting,
and validation stages
Source: Databricks
Source: HortonWorks
Video
Copyright © William El Kaim 2016 248
Hadoop V1 vs. V2
• YARN has taken an edge over the cluster management responsibilities from
MapReduce
• now MapReduce just takes care of the Data Processing and other responsibilities are
taken care of by YARN.
• Dedicated Application
Stack for Hadoop
• Casc
• Cascading
• Crunch
• Hfactory
• Hunk
• Spring for Hadoop
Forrester Big Data Hadoop Distributions, Q1 2016 Forrester Big Data Hadoop Cloud, Q1 2016
Batch
Apache Brooklyn is
an application
blueprinting and
management system
which supports a
wide range of
software and services
in the cloud.
Provides scalable
directed graphs of
data routing,
transformation,
and system
mediation logic
Paxata Trifacta
https://streamsets.com/ 295
Copyright © William El Kaim 2016
Big Data Landscape
Data integration On Demand: SnapLogic
TIBCO
ActiveMatrix
BusinessWorks 6
+ Apache Hadoop
= Big Data
Integration
http://datascience.ibm.com/ 303
Copyright © William El Kaim 2016
Big Data Landscape
Data Science: IBM Data Science Experience
• Data Science Experience is a cloud-
based development environment for
near real-time, high performance
analytics
• Available on IBM Cloud Bluemix platform
• Provides
• 250 curated data sets
• open source tools and a collaborative
workspace like H2O, RStudio, Jupyter
Notebooks on Apache Spark
• in a single security-rich managed
environment.
• Help data scientists uncover and share
meaningful insights with developers,
making it easier to rapidly develop
applications that are infused with
intelligence.
http://datascience.ibm.com/ 304
Copyright © William El Kaim 2016
Big Data Landscape
Data Science: Tamr
http://www.tamr.com/
Copyright © William El Kaim 2016 305
Big Data Landscape
Machine Learning as A Service
• Open Source
• Accord (Dotnet), Apache Mahout, Apache Samoa, Apache Spark MLlib and Mlbase,
Apache SystemML, Cloudera Oryx, GoLearn (Go), H20, Photon ML, Prediction.io, R
Hadoop, Scikit-learn (Python), Seldon, Shogun (C++), Google TensorFlow, Weka.
• Available as a Service
• Algorithmia, Algorithms.io, Amazon ML, BigML, DataRobot, FICO, Google Prediction
API, HPE Haven OnDemand, IBM’s Watson Analytics, Microsoft Machine Learning
Studio, PurePredictive, Predicsis, Yottamine.
• Examples
• BVA with Microsoft Azure ML
• Quick Review of Amazon Machine Learning
• BigML training Series
• Handling Large Data Sets with Weka: A Look at Hadoop and Predictive Models
http://zeppelin.incubator.apache.org
Source: ButlerAnalytics
Copyright © William El Kaim 2016 315
Big Data Landscape
Dataviz Tools
• For Non Developers
• ChartBlocks, Infogram, Plotly, Raw, Visual.ly
• For Developers
• D3.js, Infovis, Leaflet, NVD3, Processing.js, Recline.js, visualize.js
• Chart.js, Chartist.js, Ember Charts , Google Charts, FusionCharts, Highcharts, n3-
charts, Sigma JS, Polymaps
• More
• Datavisuaisation.ch curated list
• ProfitBricks list
• Dedicated library are also available for Python, Java, C#, Scala, etc.
Encoding Format: JSON, Distributed File Data Science: Dataiku, Cassandra, Druid, DynamoDB,
Rcfile, Parquet, ORCfile System: GlusterFS, Datameer, Tamr, R, SaS, MongoDB, Redshift, Google
Open BigQuery, etc.
HDFS, Amazon S3, Python, RapidMiner, etc.
Data
MapRFS, ElasticSearch Machine Learning: Data Warehouse
Batch BigML, Mahout,
Map Reduce Predicsys, Azure ML,
Operational TensorFlow, H2O, etc.
Systems Streaming Qlik, Tableau, Tibco, Jethro,
(ODS, IoT) Event Stream & Micro Batch Looker, IBM, SAP, BIME, etc.
BI Tools & Platforms
Ingestion Technologies: NoSQL Databases: Distributions: Cloudera,
Existing Sources Apex, Flink, Flume, Cassandra, Ceph, HortonWorks, MapR,
of Data Kafka, Amazon Kinesis, DynamoDB, Hbase, SyncFusion, Amazon Cascading, Crunch, Hfactory,
(Databases, Nifi, Samza, Spark, Hive, Impala, Ring, EMR, Azure HDInsight, Hunk, Spring for Hadoop,
DW, DataMart) Sqoop, Scribe, Storm, OpenStack Swift, etc. Altiscale, Pachyderm, D3.js, Leaflet
NFS Gateway, etc. Qubole, etc.
App. Services
Data Sources Data Ingestion Data Lake Lakeshore & Analytics Analytics App and Services
Before
Future
Claudine O'Sullivan
Copyright © William El Kaim 2016 338