Sie sind auf Seite 1von 53

Big data architectures and

the data lake

James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book Reporting with Microsoft SQL Server 2012
Agenda

Big Data Architectures


Why data lakes?
Top-down vs Bottom-up
Data lake defined
Hadoop as the data lake
Modern Data Warehouse
Federated Querying
Solution in the cloud
SMP vs MPP
?
?

Big Data Architectures ?

?
Enterprise data warehouse augmentation
Seen when EDW has been in
existence a while and EDW cant
handle new data
Cons: not offloading EDW work,
cant use existing tools, data hub
difficulty understanding data
Data hub plus EDW
Data hub is used as temporary
staging and refining, no reporting
Cons: data hub is temporary, no
reporting/analyzing done with
the data hub

(temporary)
All-in-one
Data hub is total solution, no
EDW
Cons: queries are slower, new
training for reporting tools,
difficulty understanding data,
security limitations
Modern Data Warehouse
Evolution of three previous
scenarios
Ultimate goal
Supports future data needs
Data harmonized and analyzed in
the data lake or moved to EDW for
more quality and performance
?
?

Why data lakes? ?

?
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(schema-on-write)
5. Create reports. Analyze data

Dedicated ETL tools (e.g. SSIS)


Relational Queries
ETL pipeline
LOB Results
Applications Defined schema

All data not immediately required is discarded or archived


10
Need to collect any data
Harness the growing and changing nature of data
Structured Unstructured Streaming

Challenge is combining transactional data stored in relational databases with less structured data

Big Data = All Data

Get the right information to the right people at the right time in the right format
The three Vs
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schemastored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit

Iterate

Gather data
Store indefinitely Analyze See results
from all sources

13
?
?

Top-down vs Bottom-up ?

?
Two Approaches to Information Management for Analytics:
Top-Down + Bottoms-Up
How can we
make it happen?
Prescriptive
What will Analytics
happen?
Theory
Predictive
Theory Analytics
Why did Hypothesis
Hypothesis it happen?
Diagnostic Pattern
Observation What
Analytics
happened?
Observation
Descriptive
Confirmation
Analytics
Data Warehousing Uses A Top-Down Approach
Understand Gather Implement Data Warehouse
Corporate Requirements
Strategy Reporting &
Reporting &
Analytics Design Analytics
Business Development
Requirements

Dimension Modelling Physical Design

ETL
ETL Design
Development
Technical
Requirements

Data sources
Setup Infrastructure Install and Tune
The data lake Uses A Bottoms-Up Approach
Ingest all data Store all data Do analysis
regardless of requirements in native format without Using analytic engines
schema definition like Hadoop

Devices
Batch queries
Interactive queries
Real-time analytics
Machine Learning
Data warehouse
Data Lake + Data Warehouse Better Together

What happened? What will happen?


Descriptive Predictive
Analytics Analytics

Why did it happen? How can we make it happen?


Diagnostic Prescriptive
Analytics Data sources
Analytics
?
?

Data lake defined ?

?
What is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.

A place to store unlimited amounts of data in any format inexpensively, especially for archive
purposes
Allows collection of data that you may or may not use later: just in case
A way to describe any large data pool in which the schema and data requirements are not defined
until the data is queried: just in time or schema on read
Complements EDW and can be seen as a data source for the EDW capturing all data but only
passing relevant data to the EDW
Frees up expensive EDW resources (storage and processing), especially for data refinement
Allows for data exploration to be performed without waiting for the EDW team to model and load
the data (quick user access)
Some processing in better done with Hadoop tools than ETL tools like SSIS
Easily scalable
Traditional Approaches MONITORING AND TELEMETRY

Current state of a data warehouse

ETL DATA WAREHOUSE


DATA SOURCES
BI AND ANALYTCIS

Star schemas,
views Emailed,
other read- centrally
optimized stored Excel
OLTP ERP CRM LOB structures reports and
dashboards

Well manicured, often relational Flat, canned or multi-dimensional


Complex, rigid transformations
sources access to historical data
Required extensive monitoring
Known and expected data volume Many reports, multiple versions of
and formats the truth
Transformed historical into read
structures
Little to no change 24 to 48h delay
Traditional Approaches MONITORING AND TELEMETRY

Current state of a data warehouse

ETL DATA WAREHOUSE


DATA SOURCES
BI AND ANALYTCIS

Star schemas,
views Emailed,
other read- centrally
optimized stored Excel
OLTP ERP CRM LOB structures reports and
dashboards

STALE REPORTING
INCREASE IN TIME
INCREASING DATA VOLUME NON-RELATIONAL DATA

Complex, rigid transformations cant


Increase in variety of data sources longer keep pace Reports become invalid or unusable

Increase in data volume Monitoring is abandoned Delay in preserved reports increases

Increase in types of data Delay in data, inability to transform Users begin to innovate to relieve
volumes, or react to new sources starvation
Pressure on the ingestion engine
Repair, adjust and redesign ETL
New Approaches
DATA WAREHOUSE
BI AND ANALYTCIS

Star schemas, Discover and


views consume
other read-
predictive
optimized
structures analytics, data
sets and other

Data Lake Transformation (ELT not ETL) reports

DATA SOURCES

DATA LAKE DATA REFINERY PROCESS


EXTRACT AND LOAD (TRANSFORM ON READ)

OLTP ERP CRM LOB Transform


relevant data
into data sets

FUTURE DATA
NON-RELATIONAL DATA SOURCES

All data sources are considered Extract and load, no/minimal transform
Refineries transform data on read

Leverages the power of on-prem Storage of data in near-native format


Produce curated data sets to
technologies and the cloud for integrate with traditional warehouses
storage and capture Orchestration becomes possible
Users discover published data
Native formats, streaming data, big Streaming data accommodation becomes
sets/services using familiar tools
data possible
Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure


Data Lake layers
Raw data layer Raw events are stored for historical reference. Also called
staging layer or landing area
Cleansed data layer Raw events are transformed (cleaned and mastered) into
directly consumable data sets. Aim is to uniform the way files are stored in
terms of encoding, format, data types and content (i.e. strings). Also called
conformed layer
Application data layer Business logic is applied to the cleansed data to
produce data ready to be consumed by applications (i.e. DW application,
advanced analysis process, etc). Also called workspace layer or trusted layer

Still need data governance so your data lake does not turn into a data swamp!
Should I use Hadoop or NoSQL for the data lake?

Most implementations use Hadoop as the data lake because of these benefits:

Open-source software ecosystem that allows for massively parallel computing


No inherent structure (no conversion to JSON needed)
Good for batch processing, large files, volume writes, parallel scans, sequential access (NoSQL
designed for large-scale OLTP)
Large ecosystem of products
Low cost
Con: performance
?
?

Hadoop as the data lake ?

?
What is Hadoop?
Distributed, scalable system on commodity HW OPERATIONAL DATA
SERVICES SERVICES

Composed of a few parts: AMBARI


OOZIE
FLUME
PIG
HIVE &
HCATALOG
SQOOP

HDFS Distributed file system


FALCON HBASE
LOAD &
EXTRACT MAP
REDUCE

MapReduce Programming model Core Services


NFS
YARN
WebHDFS HDFS

Other tools: Hive, Pig, SQOOP, HCatalog, HBase,


Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, Hadoop Cluster

ZooKeeper, Flume, Storm compute


&
storage . . .
. . . .
Main players are Hortonworks, Cloudera, MapR . . .
compute
&
storage

WARNING: Hadoop, while ideal for processing huge Hadoop clusters provide
scale-out storage and
volumes of data, is inadequate for analyzing that distributed data processing
on commodity hardware
data in real time (companies do batch analytics
instead)
Microsoft Confidential
Hortonworks Data Platform 2.5

Simply put, Hortonworks ties all the open source products together (22)
The real cost of Hadoop

http://www.wintercorp.com/tcod-report/
Use cases using Hadoop and a DW in combination
Bringing islands of Hadoop data together

Archiving data warehouse data to Hadoop (move)


(Hadoop as cold storage)

Exporting relational data to Hadoop (copy)


(Hadoop as backup/DR, analysis, cloud use)

Importing Hadoop data into data warehouse (copy)


(Hadoop as staging area, sandbox, Data Lake)
?
?

Modern Data Warehouse ?

?
Modern Data Warehouse

Think about future needs:


Increasing data volumes
Real-time performance
New data sources and types
Cloud-born data
Multi-platform solution
Hybrid architecture
The
Dream
Modern Data Warehouse
The
Reality
Base Architecture : Big Data Advanced Analytics Pipeline
Data Sources Ingest Prepare Analyze Publish Consume
(normalize, clean, etc.) (stat analysis, ML, etc.) (for programmatic (Alerts, Operational Stats,
consumption, BI/visualization) Insights)

Machine Learning
(Anomaly Detection)
Telemetry

Event Stream Analytics Live / real-time data


Hub (real-time analytics) stats, Anomalies and PowerBI
aggregates
dashboard
Data Near Realtime Data Analytics Pipeline using Azure Steam Analytics
in Motion
Data
at Rest
HDI Custom ETL Machine Learning
Aggregate /Partition

Customer
MIS
dashboard of
Azure SQL predictions / alerts
Azure Storage Blob (Predictions)
transfer using Azure
Scheduled hourly

Data Factory

Interactive Analytics and Predictive Pipeline using Azure Data Factory

dashboard of
operational stats
Azure Data Lake Azure Data Lake Analytics
Azure SQL
Storage (Big Data Processing)

Big Data Analytics Pipeline using Azure Data Lake


41
Roles when using both Data Lake and DW
Data Lake/Hadoop (staging and processing environment)
Batch reporting
Data refinement/cleaning
ETL workloads
Store historical data
Sandbox for data exploration
One-time reports
Data scientist workloads
Quick results

Data Warehouse/RDBMS (serving and compliance environment)


Low latency
High number of users
Additional security
Large support for tools
Easily create reports (Self-service BI)
A data lake is just a glorified file folder with data files in it how many end-users can accurately create reports from it?
Microsoft data platform solutions
Product Category Description More Info

SQL Server 2016 RDBMS Earned top spot in Gartners Operational Database magic https://www.microsoft.com/en-us/server-
quadrant. JSON support cloud/products/sql-server-2016/
SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. https://azure.microsoft.com/en-
Has built-in high availability and disaster recovery. JSON us/services/sql-database/
support
SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. https://azure.microsoft.com/en-
Provision and scale quickly. Can pause service to reduce us/services/sql-data-warehouse/
cost
Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and https://www.microsoft.com/en-us/server-
seamless integration of all your data cloud/products/analytics-platform-
system/
Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of https://azure.microsoft.com/en-
your data while making it faster to get up and running with us/services/data-lake-store/
batch, streaming, and interactive analytics
Azure Data Lake Analytics On-demand analytics job Cloud-based service that dynamically provisions resources https://azure.microsoft.com/en-
service/Big Data-as-a- so you can run queries on exabytes of data. Includes U- us/services/data-lake-analytics/
service SQL, a new big data query language
HDInsight PaaS Hadoop compute A managed Apache Hadoop, Spark, R, HBase, and Storm https://azure.microsoft.com/en-
cloud service made easy us/services/hdinsight/
DocumentDB PaaS NoSQL: Document Get your apps up and running in hours with a fully https://azure.microsoft.com/en-
Store managed NoSQL database service that indexes, stores, and us/services/documentdb/
queries data using familiar SQL syntax
Azure Table Storage PaaS NoSQL: Key-value Store large amount of semi-structured data in the cloud https://azure.microsoft.com/en-
Store us/services/storage/tables/
Cortana Intelligence Suite
Integrated as part of an end-to-end suite
Information Big Data Stores Machine Learning Intelligence
Data Management and Analytics
People
Sources
Machine Cognitive
Data Factory Data Lake Store
Learning Services

SQL Data Data Lake Bot Web


Data Catalog Warehouse Analytics Framework

Apps HDInsight
(Hadoop and Mobile
Event Hubs Cortana
Spark) Apps

Stream Analytics Bots

Dashboards &
Visualizations
Sensors Automated
and Power BI Systems
devices

Data Intelligence Action


?
?

Federated Querying ?

?
Federated Querying
Other names: Data virtualization, logical data warehouse, data
federation, virtual database, and decentralized data warehouse.

A model that allows a single query to retrieve and combine data as it sits
from multiple data sources, so as to not need to use ETL or learn more
than one retrieval technology
PolyBase
Query relational and non-relational data with T-SQL

By preview early this year PolyBase will add support for Teradata, Oracle,
SQL Server, MongoDB, and generic ODBC (Spark, Hive, Impala, DB2)

Vs U-SQL: PolyBase is interactive while U-SQL is batch. U-SQL more code


to query data but more formats (JSON) and libraries/UDOs and supports
writes to blob/ADLS
?
?

Solution in the cloud ?

?
Benefits of the cloud
Agility
Unlimited elastic scale
Pay for what you need
Innovation
Quick Time to market
Fail fast
Risk
Availability
Reliability
Security

Total cost of ownership calculator: https://www.tco.microsoft.com/


Constraints of on-premise data

Scale constrained to on-premise procurement


Capex up-front costs, most companies instead prefer a yearly operating expense (OpEx)
A staff of employees or consultants must be retained to administer and support the hardware and software
in place
Expertise needed for tuning and deployment
Talking points when using the cloud for DW
Public and private cloud
Cloud-born data vs on-prem born data
Transfer cost from/to cloud and on-prem
Sensitive data on-prem, non-sensitive in cloud
Look at hybrid solutions
?
?

SMP vs MPP ?

?
SMP vs MPP

SMP - Symmetric
Multiple CPUs used to complete individual processes simultaneously
All CPUs share the same memory, disks, and network controllers (scale-up)

Multiprocessing

All SQL Server implementations up until now have been SMP
Mostly, the solution is housed on a shared SAN

MPP - Massively Uses many separate CPUs running in parallel to execute a single program
Shared Nothing: Each CPU has its own memory and disk (scale-out)
Parallel Processing Segments communicate using high-speed network between nodes
DW SCALABILITY SPIDER CHART
Data Volume
MPP Multidimensional
Mixed Scalability
Workload 5 PB
Query Concurrency SMP Tunable in one dimension
500 TB on cost of other dimensions
Strategic, Tactical
Loads, SLA
100 TB
Strategic, Tactical 10.000 The spiderweb depicts
Loads 50 TB
important attributes to
10 TB 1.000
consider when evaluating
Strategic, Tactical
Data Warehousing options.
Strategic 100
Data Big Data support is newest
Query complexity
Freshness Near Real Time Daily Weekly 3-5 Way dimension.
Data Feeds Load Load Joins
5-10 Way Joins +
Joins OLAP operations +
Simple Aggregation +
Batch Reporting,
Star Complex Where
Repetitive Queries
constraints +
Multiple, Views
MBs Integrated Parallelism
Stars
Ad Hoc Queries
Data Analysis/Mining Normalized

GBs Multiple, Integrated


Stars and Normalized

Query Freedom
TBs Schema Sophistication

Query Data Volume


Summary
We live in an increasingly data-intensive world
Much of the data stored online and analyzed today is more varied than the data stored in recent years
More of our data arrives in near-real time
Data is the new currency!

This present a large business opportunity. Are you ready for it?
Other Related Presentations
Building a Big Data Solution
Choosing technologies for a big data solution in the cloud
How does Microsoft solve Big Data?
Benefits of the Azure cloud
Should I move my database to the cloud?
Implement SQL Server on a Azure VM
Relational databases vs Non-relational databases
Introduction to Microsofts Hadoop solution (HDInsight)
Introducing Azure SQL Database
Introducing Azure SQL Data Warehouse
Visit my blog at: JamesSerra.com (where these slide decks are posted under the Presentation tab)
Resources
Why use a data lake? http://bit.ly/1WDy848
Big Data Architectures http://bit.ly/1RBbAbS
The Modern Data Warehouse: http://bit.ly/1xuX4Py
Hadoop and Data Warehouses: http://bit.ly/1xuXfu9
Q&A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the Presentations tab)

Bewerten