Sie sind auf Seite 1von 59

Azure Data Factory – Preview of V2

Ian Pike
Technology Solution Professional (Data & AI)
@ianpike
Microsoft Data Platform
All-up Overview
DATA AI

CLOUD
BENEFITS OF DATA, AI, CLOUD TRANSFORMATION

NEARL
Y
DOUBL
$40k
E
operating margin

more revenue per employee


$100M
Additional average operating income for

50%+ higher average net the most digitally transformed enterprises


income on revenue

Source: Keystone Strategy 2016


MICROSOFT CONFIDENTIAL – INTERNAL ONLY
THE MODERN DATA ESTATE

HYBRID

On-premises Cloud

Operational databases Operational databases


Data warehouses Data warehouses
Data lakes Data lakes

Reason over any data, anywhere Flexibility of choice Security and privacy

MICROSOFT CONFIDENTIAL – INTERNAL ONLY


AZURE DATA FACTORY
AZURE DATA FACTORY
Fully-managed data integration service in the
cloud

Flexible Hybrid Data movement


I n t e g r a t e Data integration I n t e g r a t e Data orchestration S i m p l i f y As-a-service

Security and compliance


Modernize your data warehouse with
Azure big data and advanced analytics
services such as HDInsight and Data
lake Analytics
Flexible
Data integration

Build custom data-driven SaaS


applications unique to your customer
data using your language of choice

Hybrid
Data Bring together all your sources of data
orchestration
to understand your customers and drive
impactful business decisions

Data movement
As-a-service
Orchestrate your data pipeline
wherever your data lives – in cloud or
in self-hosted environment

Flexible
Data integration
Meet your security and compliance
needs while taking advantage of truly
hybrid integration capabilities

Hybrid Execute your SQL Server Integration


Data Services (SSIS) packages in the cloud
orchestration

Data movement
As-a-service
Accelerate integration with managed
data movement as-a-service

Flexible Improve your TCO with 30+ natively


Data integration
supported connectors across 18 global
points of presence

Elastic data movement at scale


Hybrid
Data
orchestration Serverless data movement with no
infrastructure to manage

Data movement
As-a-service
HYBRID DATA INTEGRATION AT SCALE
Data Processing & Movement CLOUD
Relational data Any BI tool

Dashboards | Reporting
Mobile BI | Cubes

OLTP ERP CRM LOB

Advanced
V-NET
Analytics
Machine Learning
Non-relational data Stream analytics Cognitive | AI

Any language
Web Media Social media Devices ON-PREMISE
.NET | Java | R | Python
Ruby | PHP | Scala

AZURE DATA FACTORY ORCHESTRATES DATA PIPELINE ACTIVITY WORKFLOW & SCHEDULING
ADF: Cloud-First Data Integration Objectives
Consume hybrid disparate data
On-prem + Cloud
Grow ADF ecosystem of structured, un-structured, semi-structured data connectors
Calculate and format data for analytics
Transform, aggregate, join, normalize
Separate data flow (transformation) from control flow (orchestration)
Address large-scale Big Data requirements
Scale-up or Scale-out data movement and transformation
Support multiple processing engines
Operationalize
Support flexible scheduling and triggering mechanism for broad range of use cases
Manage & monitor multiple pipelines (via Azure Monitor & OMS)
Support secure VNET environments
Enable SSIS package execution
Execute SSIS packages in ADF Integration Runtime
ADF: Cloud-First Data Integration Scenarios

Lift and Shift to the Cloud


• Migrate on-prem DW to Azure
• Lift and shift existing on-prem SSIS packages to cloud
• No changes needed to migrate SSIS packages to Cloud service
DW Modernization
• Modernizing DW arch to reduce cost & scale to needs of big data (volume, variety, etc)
• Flexible wall-clock and triggered event scheduling
• Incremental Data Load
Build Data-Driven, Intelligent SaaS Application
• C#, Python, PowerShell, ARM support
Big Data Analytics
• Customer profiling, Product recommendations, Sentiment Analysis, Churn Analysis, Customized
offers, customer usage tracking, customized marketing
• On-demand Spark cluster support
Load your Data Lake
• Separate control-flow to orchestrate complex patterns with branching, looping, conditional
processing
ADF V2 Improvements
• Integration Runtimes (IR) replace DMG, provide data movement and activity
dispatch on-prem or in the cloud
• Supports resources within virtual networks
• Integration Runtime includes SSIS option to lift & shift SSIS packages to the Cloud
• Separation of “control flow” & “data flow” capabilities for more flexible pipeline
management
• Looping, conditionals, dependencies, parameters
• Python SDK
• On-Demand Spark support
• Flexible pipeline scheduling with wall-clock and triggered executions
• Expanded use cases: From primarily time window-oriented pipelines, to trigger-
based on-demand for more flexible ETL and data integration orchestrations
• (Coming in GA) UX pipeline and data transformation builder code-free experience
New ADF V2 Concepts
Concept Description Sample
Control Flow Orchestration of pipeline activities that includes chaining activities in a { "name":"MyForEachActivityName",
sequence, branching, conditional branching based on an "type":"ForEach",
expression, parameters that can be defined at the pipeline level and arguments passed "typeProperties":{ "isSequential":"true",
while invoking the pipeline on demand or from a trigger. Also includes custom state "items":"@pipeline().parameters.mySinkDatasetFolderP
passing and looping containers, I.e. For-each, Do-Until iterators. athCollection",
"activities":[
{
"name":"MyCopyActivity", "type":"Copy",
"typeProperties": …
Runs A Run is an instance of the pipeline execution. Pipeline Runs are typically instantiated by POST
passing the arguments to the parameters defined in the Pipelines. The arguments can https://management.azure.com/subscriptions/<subId
be passed manually or properties created by the Triggers. >/resourceGroups/<resourceGroupName>/providers/
Microsoft.DataFactory/factories/<dataFactoryName>/
Activity Logs Every activity execution in a pipeline generates activity start and activity end logs event pipelines/<pipelineName>/createRun?api-version=2
017-03-01-preview
Integration Runtime Replaces DMG as a way to move & process data in Azure PaaS Services, self-hosted or on
prem or IaaS
Works with VNETs
Enables SSIS package execution
Scheduling Flexible Scheduling "type": "ScheduleTrigger", 
    "typeProperties": { 
Wall-clock scheduling
      "recurrence": { 
Event-based triggers         "frequency": <<Minute, Hour, Day, Week, Year>>, 
        "interval": <<int>>,             // optional, how often to fire (default to
1) 
        "startTime": <<datetime>>, 
        "endTime": <<datetime>>, 
        "timeZone": <<default UTC>> 
        "schedule": {                    // optional (advanced scheduling
specifics) 
          "hours": [<<0-24>>], 
          "weekDays": ": [<<Monday-Sunday>>], 
          "minutes": [<<0-60>>], 
          "monthDays": [<<1-31>>], 
          "monthlyOccurences": [ 
               { 
                    "day": <<Monday-Sunday>>, 
                    "occurrence": <<1-5>> 
New ADF V2 Concepts
Concept Description Sample
On-Demand Instantiate a pipeline by passing arguments as parameters defined in a Invoke-AzureRmDataFactoryV2PipelineRun
pipeline and execute from script / REST / API. -DataFactory $df -PipelineName
Execution "Adfv2QuickStartPipeline" -ParameterFile
.\PipelineParameters.json

Parameters Name-value pairs defined in the pipeline. Arguments for the Accessing parameters of other activities Using
defined parameters are passed during execution from the run context created expressions
by a Trigger or pipeline executed manually. Activities within the pipeline @parameters(“{name of parameter}”)
consume the parameter values.   @activity(“{Name of Activity}”).output.RowsCopied
A Dataset is a strongly typed parameter and a reusable/referenceable entity.
An activity can reference datasets and can consume the properties defined in
the Dataset definition 
A Linked Service is also a strongly typed parameter containing the
connection information to either a data store or a compute environment. It is
also a reusable/referenceable entity. 
Incremental Data Leverage parameters and define your high-water mark for delta copy while
Loading moving dimension or reference tables from a relational store either on
premises or in the cloud to load the data into the lake
ADF: Basic Concepts
ADF is Microsofts unified platform for ETL/ELT on Azure
ADF allows you to build data pipelines and schedule them
A Data Pipeline is a chain/group of activities such as data movements/transformations
Data Sets are named references/pointers to data, as inputs/outputs
Some activities produce/consume datasets
Linked services represent those connection info of resources

produces

Data set Activity Pipeline


(Table, file) (Hive, Stored Proc, (schedule
consumes copy) Is a logical monitor manage)
grouping of

Represents a
data item(s)
stored in
Runs on

Linked
Service
(SQL Server, Hadoop)
ADF: ADF v2 Integration Runtimes
Integration Runtime (IR) is ADF compute resource that can perform data movements/transformations, including
SSIS executions
Customers can deploy one/many instances of IR
IR can run in the public network, inside a VNet, or behind a corporate firewall
In ADF v2 we have an IR called Azure-SSIS IR so that customer can run their SSIS packages on it

Data Pipelines ADF


Data
Process Activity Process Activity

Orchestration Orchestration
and monitoring and monitoring

Data Movements Hadoop/Spark


SSIS Package Data Lake Analytics
Executions Machine Learning
Data Wrangling SQL Stored Procedures
(sprocs)
Azure HDI, DLA, ML,
Custom Activities
SQL
Integration Runtime
DB/MI/DW, Batch Etc
(IR)
ADF: ETL/ELT Services in Azure
Azure SSIS IR is now in Public Preview (25th Sept 2017)
East US/North Europe Azure Data Centres
Six node size
Standard SKU/Licence
24/7 live site support

Post Public Preview


More locations
More Node sizes
Enterprise SKU/licence
Support for Custom/3rd party component installations
Azure Resource Manager (ARM) VNet support
ADF: Provisioning Methods
Azure-SSIS IR can provisioned via Custom code/Powershell using ADF v2 .net
SDK/API

Azure-SSIS IR will be provisioned via ADF v2 portal (work in progress)


ADF: Deployment Methods
SSDT/SSMS using Integration Services Deployment Wizard

Command Line Interface (CLI)


Run isdeploymentwizard.exe

Custom code/PSH using SSIS Managed Object Model (MOM) .net SDK/API
T-SQL Scripts executing SSISDB sprocs
Execute SSISDB sproc [catlog].[deploy_project]
ADF: Execution Methods
SSMS

Command Line
Run dtexec.exe

Custom code/PSH using SSIS MOM .net SDK/API

T-SQL scripts executing SSISDB sprocs


Execute SSISDB sprocs [catlog].[create_execution] + [catlog].[set_execution_parameter_value] + [catalog].[Start_execution]
ADF: Monitoring Methods
SSMS

Custom code/PSH using SSIS MOM .net SDK/API


Show provisioning status, requested/allocated/available nodes, etc

ADF v2 Portal
ADF: Scheduling Methods
If you keep on-prem SQL Server
Via on-prem

Via on-prem SQL Server Agent

If you use Azure SQL DB Server to host SSISDB


Via Elastic Jobs (Private Preview)

If you use Azure SQL MI server to host SSISDB


Via Azure SQl MI Agent (Extended Private Preview)

If you are using ADF


Via ADFv1/v2 Spoc activity
Via ADF v2 SSIS Activity (work in progress)
ADF: Pipelines in Portal (Preview)
ADF: Pipelines in Portal (Preview)
ADF: Pipelines in Portal (Preview)
ADF: Work in Progress in Portal (Preview)
ADF: Work in Progress (Preview)
ADF: Work in Progress (Preview)
ADF: Work in Progress (Preview)
ADF: Work in Progress (Preview)
ADF: Work in Progress (Preview)
ADF: Work in Progress (Preview)
ADF: Work in Progress (Preview)
Links
 Videos
 Lift and shift SSIS to the cloud by using Azure Data Factor
 https://www.youtube.com/watch?v=Crmi56v_YA8
 New capabilities for data integration in the cloud
 https://www.youtube.com/watch?v=BQ4K2Uch_dw&t=4789s

 Documentation and Set-up for demos of SSIS


 https://docs.microsoft.com/en-us/azure/data-factory/
 https://docs.microsoft.com/en-us/azure/data-factory/tutori
al-deploy-ssis-packages-azure
Patterns & Scenarios for ADF V2
Hybrid Data Integration Pattern 1:
Analyze blog comments

Capture blog comments via API

Azure Data Factory Azure SQL Database Power BI Dashboard


Drop into
Transform via SPROC (ELT)
Blob Store
Transform via Dataflow Visualize and analyze
(ETL)
Copy & lookup
Data Management
Gateway Req’d for ADF

SQL Server
(on-premises)
Hybrid Data Integration Pattern 2:
Data-Driven SaaS App: Sentiment Analysis w/Machine Learning

Blob Storage

Loop through all


files in folder
Azure Functions

Custom App
Azure Data Factory
Hybrid Data Integration Pattern 3:
Modern Data Warehouse

Daily flat Scheduling Transform, Power BI


files Pipeline / Job Combine, (Visualize)
Monitoring etc Analyze Move
Pipeline / Job On Demand

Management
OLTP DB
Tables
On Premises Social Media
Azure Blob Customer Call AML: Churn Analytical
(un/semi
SQL/Oracle Storage Details Model Schemas
structured)

Azure Data Factory (PaaS) Azure DW


ADFv2: Control Flow
Coordinate pipeline activities into finite execution steps to enable looping,
conditionals and chaining while separating data transformations into
individual data flows

My Pipeline 1

For Each… My Pipeline 2


Trigger Success, Success, Activity 3 Activity
params Activity params Activity 1

Event 1 2
Activity 4
Wall Clock Activity
On Demand 2
Error,
para “On
ms Error”
… …
Activity
1
Integration Runtime
ADF Integration Runtime (IR)
 ADF compute environment with multiple
Portal Application & SDK capabilities:
- Activity dispatch & monitoring
- Data movement
- SSIS package execution
 To integrate data flow and control flow across the
Azure Data Factory Service
enterprises’ hybrid cloud, customer can
instantiate multiple IR instances for different
network environments:
- On premises (similar to DMG in ADF V1)
- In public cloud
- Inside VNet
 Bring a consistent provision and monitoring
Self-Hosted Azure IR
IR experience across the network environments
Data Movement & Activity Data Movement & Activity
Dispatch on-prem, Cloud, Dispatch In Azure Public
VNET Network, SSIS
VNET coming soon
Command & Control
Data Flow UX & SDK
Authoring | Monitoring/Mgmt

Azure Data Factory Service


Scheduling | Orchestration | Monitoring

On Premises Apps & Data Cloud Apps, Svcs & Data


Command & Control
Data Flow UX & SDK
Authoring | Monitoring/Mgmt

Azure Cloud
Azure Data Factory Service
Scheduling | Orchestration | Monitoring
PaaS Cloud Host

Integration Runtime

On Premises Apps & Data Cloud Apps, Svcs & Data


Command & Control
Data Flow UX & SDK
Authoring | Monitoring/Mgmt

Azure Cloud
Azure Data Factory Service
Scheduling | Orchestration | Monitoring
PaaS Cloud Host

Integration Runtime

Installable Agent

Integration Runtime

On Premises Apps & Data Cloud Apps, Svcs & Data


Command & Control
Data Flow UX & SDK
Authoring | Monitoring/Mgmt

Azure Cloud
Azure Data Factory Service
Scheduling | Orchestration | Monitoring
PaaS Cloud Host

Integration Runtime
Integration Runtime

- Activity Dispatch/Monitor (spark, copy, ml, etc)


Installable Agent

- DataRuntime
Integration Movement
- SSIS Package Execution
On Premises Apps & Data Cloud Apps, Svcs & Data
Azure Data Factory “Integration Runtime” deployed on premises for transformation
and then moved to cloud
Azure (US West)
HP Inc
Customer 1 (Global)
Public Internet Border
HP Prod Firewall
Customer Border
1 firewall border
443

Data Factory
(Orchestration
micro-service)

443
“IR” 443
Storage (Azure)
Integration
ADF FooFabric On-prem
On-prem SQL Data
HP Hadoop Cluster
Warehouse

UAM Server
Azure Data Factory “Integration Runtime” deployed inside VNet

Azure (US West)


Public Internet Border

HP Inc
Customer 2 (Global)
Data Factory
(Orchestration
micro-service)

HP Prod Firewall
Customer 2 firewallBorder
border
443

Customer
HP VNET
2 vnet

“IR”
Storage (Azure)
SQL Data
HP Hadoop Cluster ADF Foo Fabric
Integration in vnetin VNet Warehouse

expressroute
Private circuit

UAM Server
SSIS in ADF
SSIS as an Integration Runtime
 Provision SSIS IR via Data Factory

 Use SQL Server Data Tools (SSDT)


to author/deploy SSIS packages
ML
 Use SQL Server Management Studio
HDI Provisio
n SSIS IR on (SSMS) to execute/manage SSIS
ADF packages
compute
resource
Azure Data
 Target SSIS customers who want to
Cloud Factory V2 move all/part of their on-premises
On- o
workloads and just “lift & shift”
epl
Premises thor
/D Execute/Manag many existing packages to Azure
Au e
y
 Independent Software Vendors
(ISVs) can build extensions/Software
SSDT SSMS as a Service (SaaS) on SSIS Everest
Au e
y tho nag
r/ a
De /M
plo SSIS te
e cu
Server Ex
Traditional ETL

1. SSIS on Prem (to SQL Svr)


• Lift & Shift w/ compatibility

2. SSIS on IaaS (to SQL on IaaS | Az DB)

• Want PaaS benefits (scale, no VM mgmt, etc)


• Mix reuse & modernization

3. SSIS Runtime in ADF

Trigger SSIS Package A SSIS Package “As-is”


Traditional ETL

1. SSIS on Prem (to SQL Svr)


• Lift & Shift w/ compatibility

2. SSIS on IaaS (to SQL on IaaS | Az DB)

• Want PaaS benefits (scale, no VM mgmt, etc)


• Mix reuse & modernization

3. SSIS Runtime in ADF

Trigger SSIS Package A SSIS Package “As-is”

Trigger Copy SSIS Pkg A AML … SSIS Package “Reuse”

Trigger Copy Spark AML … Transform “@Scale”


Traditional ETL
Modern DW & Data Driven SaaS

1. SSIS on Prem (to SQL Svr) 1. ADF V1 (Time-series, Tumbling Window)


(scenarios: log processing, etc.)
• Lift & Shift w/ compatibility

2. SSIS on IaaS (to SQL on IaaS | Az DB)


Migrate v1 to v2:
• Want PaaS benefits (scale, no VM mgmt, etc) • Much more flexible model
• Mix reuse & modernization • Similar or cheaper in many scenarios

3. SSIS Runtime in ADF 2. ADF v2 (PaaS: Control Flow, Data Flow)

Trigger SSIS Package A SSIS Package “As-is”

Trigger Copy SSIS Pkg A AML … SSIS Package “Reuse”

Trigger Copy Spark AML … Transform “@Scale”


ADF V2 Pricing (Preview Prices)
Orchestration
Units: Activity Runs Azure Integration Runtime for SSIS
Activity Run: Single run or re-run of an activity or trigger Price for Enterprise
VM Price for Standard
Edition $/hr Edition $/hr
Where the activity is Price
orchestrated A4 v2 (4core) $0.420  $0.956 
Azure Integration Runtime 0 – 50K runs: $.55/1k runs
(Public Network environment) 50K+ runs: $.50/1k runs A8 v2 (8core) $0.862  $1.935 

D1 v2 (1core) $0.296  $0.832 


Self-Hosted Integration   $0.397  $0.933 
D2 v2 (2core)
Runtime (Private Network $.75/1K runs
environment) D3 v2 (4core) $0.599  $1.136 

D4 v2 (8core) $1.199  $2.271 


The ADF Azure-SSIS IR is a fully-managed service to execute your SSIS packages. You can specify
Data Movement the VM type and the number of nodes for your dedicated IR environment. Note that you have to
pay for a SQL Azure DB instance separately in order to host the SSIS catalog in the cloud. If you
Units: DMU are using the Azure Data Factory to move data outside Azure network, you are required to pay
DMU: Unit of measure used to copy data from source to sink for the amount of the data egress.

Price Note that these VM prices are discounted rates from the list prices of SQL Server VMs.
Azure Integration Runtime $0.125/hr
(Public Network environment) Enterprise SKUs for SSIS will not be available during public preview.
Self-Hosted Integration $0.05/hr
Runtime (Private Network
environment)

Inactive Pipelines (Pipelines not associated with a


trigger with zero runs for a week) : $.20 / week
Scenario 1: Data movement across different data sources
Suppose you have a pipeline that has:

1. Copy Activity 1 moves data from On-Premises SQL Server to Azure Blob to be consumed later by a
customized application
2. Copy Activity 2 moves data from Azure Blob to Azure SQL Database

Assume that the copy from On-Premises SQL Server to Azure Blob takes 2 hours. And the copy from Azure Blob to
Azure SQL Database takes 1 hour with 2 DMUs. The pipeline is executed 30 times in a month.

Copy from On-Premises SQL Server to Azure Blob:


Data Movement: 30 activity runs X 2 hours duration for every run X $0.05 = $3
Activity Runs on Self-Hosted Integration Runtime (30 runs) + $.75

Subtotal = $3.75

Copy from Azure Blob to Azure SQL Database:


30 activity runs X 1 hour duration for every run X 2 DMUs x $.125 = $7.50
Activity Runs on Azure Integration Runtime (30 runs) + $0.55

Subtotal = $8.05

Total Price (per month) $3.75 + $8.05 = $11.80


Scenario 2: Orchestrating data transformation activities
on compute services
Suppose you have a pipeline that has:

1. Spark Activity to run Spark application on an Azure HDInsight cluster in Azure Virtual Network
2. Azure Data Lake Analytics U-SQL Activity to execute U-SQL on Azure Data Lake Analytics

Assume the pipeline was executed for 30 times in one month.

Spark Activity Data Movement $0


Activity Runs on Self-Hosted Integration Runtime (30 runs) + $.75

Subtotal = $.75

Azure Data Lake Analytics U-SQL Activity Data Movement $0


Activity Runs on Azure Integration Runtime (30 runs) + $.55

Subtotal = $.55

Total Price (per month) $1.30


Scenario 3: Lift & Shift SSIS Packages
Suppose you have set up a SQL Server Standard and SSIS on a local 4-cores physical machine to run your daily ETL workload. Your ETL workload
includes 100 SSIS packages and each of these packages takes about 5 mins to run. These packages are executed once daily and it requires about 8
hours each day to complete all the packages.

In order to lift and shift your SSIS package execution to Azure Data Factory, you need to provision the SSIS dedicated Integration Runtime via the Azure
Data Factory portal

During the provision, you need to select the VM Type that match closely to the compute power you use on-premises. In this example, we will choose the
A4 v2 type and 1 node to match the compute power on-premises. You are also required to create a SQL Azure Database instance for hosting the SSIS
catalog on Azure.

Integration
We Runtime
choose the Dedicated
S0 Standard Cost SSIS Integration Runtime is aIntegration
forexample.
for this Runtime
dedicated pool Dedicated
so the computefor the
resource Cost
is dedicated only for you (not
24/7 execution time only
shared with other user). You can choose to keep the dedicated integration runtime pool 24/7 available or you can choose to spin it up and down
SSIS Integration Runtime – A4 v2 $0.42 / hr SSIS Integration Runtime – A4 v2 $0.42 / hr
X 1 node X 1 node
X 8 hours / day (dedicated pool)
X 24 hours / day (dedicated
X 31 days / month
pool)
SSIS Integration Runtime 24/7 cost = $104.16 / month
X 31 days / month
   
SSIS Integration Runtime 24/7 cost $312.48 / month SQL Azure for SSIS Catalog – S0 $0.0202 / hr
= X 24 hours / day
    X 31 days / month
SQL Azure for SSIS Catalog – S0 $0.02 / hr SQL Azure for SSIS Catalog Cost = $15.03 / month
X 24 hours / day    
ADF to orchestrate the start / stop SSIS  
X 31 days / month
Integration runtime
SQL Azure for SSIS Catalog Cost = $15.03 / month 2 custom activities – start / stop IR 2 activity runs / day
    X 31 days / month
Total monthly cost to run SSIS ETL $327.51 / month Total ADF Activity Cost $.55 / month for 1 – 1000 runs
workload    
Total monthly cost to run SSIS ETL workload $119.74 / month

Das könnte Ihnen auch gefallen