Sie sind auf Seite 1von 61

Scalable ETL with Talend and Hadoop

Talend, Global Leader in Open Source Integra7on Solu7ons

Cdric Carbone Talend CTO


Twitter : @carbone
ccarbone@talend.com
Why speaking about ETL with Hadoop

Hadoop is complex
BI consultants dont have the skill to manipulate Hadoop
Biggest issue in Hadoop project is to nd skilled Hadoop
Engineers

ETL tool like Talend can help to democra7ze Hadoop


Trying to get from this
to this

Why Talend

Talend generates code that is executed within map reduce. This open
approach removes the limitation of a proprietary engine to provide a truly
unique and powerful set of tools for big data.
BIG Data Management
Big Data Big Data Management
Produc7on

Big Data Big Data Big Data


RDBMS Integra7on Quality Consump7on
Analy7cal DB
NoSQL DB
ERP/CRM Mining
SaaS
Social Media Analy7cs
Web Analy7cs
Log Files Storage Parsing
Processing Checking Search
RFID
Filtering
Call Data Records
Sensors Enrichment
Machine-Generated

Turn Big Data into


actionable information
Two methods for inserting data quality into a
big data job

1. Pipelining: as part of the load process

2. Load the cluster than implement and execute a data


quality map reduce job


E-T-L


Extract Transform - Load


E- DQ



-L

Extract Improve/Cleanse - Load
Pipelining: data quality with big data

CRM
DQ


ERP DQ


DQ
Finance
Big Data
DQ
Social
Networking
Use tradi7onal data quality tools
DQ Once and done
Mobile Devices
Big data alternative: Load and improve within
the cluster

CRM
DQ

ERP
DQ

Finance
Big Data

Social
Networking
Load rst, improve later
Complex matching cannot be done outside
Mobile Devices
One key DQ rules: Match
Find duplicates within Hadoop
Todays matching algorithms are processor-intensive

Tomorrows matching algorithms could be more


precise, more intensive
What is Hadoop?
Whats hadoop

The Apache Hadoop project develops open-source


so^ware for reliable, scalable, distributed compu7ng.

Java framework for storage and running data


transforma7on on large cluster of commodity
hardware

Licensed under the Apache v2 license

Created from Google's MapReduce, BigTable and


Google File System (GFS) papers
Hadoop ecosystem

Sqoop
Pig
Hive
Oozie
HCatalog
Mahout

HBase
Talend for Big Data : Hadoop Story

4.0: [April 2010] 4.1: [Oct 2010]


4.2: [May 2011] 5.0: [Nov 2011]
Put or get data into Hadoop Query (Hive)
Transforma7on Hbase NoSQL.
Haddop through Bulk loald & fast export to
(Pig) Extend our tPig*
HDFS connectors Hadoop (Sqoop)

5.1: [May 2012] 5.3 - [June 2013]


5.2: [Oct 2012] Visual Pig mapping
Metadata (Hcatalog) Visual ELT mapping (Hive)
Deployement & Scheduling (Oozie) Machine Learning (Mahout)
DataLineage & Impact Analysis
Na7ve MapReduce Code Gen
Embeded into HDP

HCatalog



Democratizing Integration with Data
Integration tools for Big Data
WordCount

WordCount
Comes with Hadoop
First demo that everyone tries!

How-to in Talend Big Data


Simple read, count, load results
No coding, just drag-n-drop
Runs remotely
Map Reduce

Data Node 1

Data Node 2
Map Reduce
Map Reduce
Map Reduce
Map Reduce
In Java
WordCount: Howto with a Graphical ETL
Thank You!



Let us show you
Talend Open Studio
Generate Pure Map Reduce
Pig Latin Script generation

FOREACH tPigMap_1_out1_RESULT GENERATE $4 AS


Revenu , $6 AS Label
HiveQL generation
HDFS Management and Sqoop
Apache Mahout

Big Data can also be a blob of


data to an organiza7on
Apache Mahout provides
algorithms to understanding
data data mining
You dont know what you
dont know. and mahout will
tell you.
Metadata Management

Centralize Metadata repository for Hadoop Cluster,


HDFS, Hive
Versioning
Impact Analysis and Data Lineage

HCatalog accros
HDFS
Hive
Pig
Thank You!



Choose your Hadoop distro

Widely adopted
Management tooling is not OSS

Fully OpenSource
Strong Developer ecosystem

More proprietary
GTM partner with AWS

A lot of more are comming


Choose your Hadoop distro

Provide tooling :
For installa7on

For server monitoring


But
No GUI for parsing,
transforming, easily
loading. No data
management
Parse and Standardize

Big Data is not always structured


Correct big data so that data conforms to the same
rules
Profiling & Monitor DQ
Implications for Integration

DATA
Challenge: Informa7on explosion increases
complexity of integra7on and requires
governance to maintain data quality

Requirement: Informa7on processing
must scale
Implications for Integration

APPLICATION

Challenge: Brisle, point-to-point


connec7ons cannot adapt to evolving
business requirements, new channels,
and quickly changing topologies

Requirement: Applica7on architecture
must scale

Implications for Integration

PROCESS

Challenge: Compe77ve market forces drive Requirement: Business processes


frequent process changes and increased must scale
process complexity

Implications for Integration
Challenge: Interdependencies
across data, applica7ons and
processes require more
resources and budget

Requirement: Resources and
skillsets must scale


Integration at Any Scale

True
scalability for
Any integra7on challenge
Any data volume
Any project size

Enables integration
convergence

Technology that Scales

Map
Java SQL Camel
Reduce

STANDARDS-BASED CODE GENERATOR


Easy to learn, exible to adopt, No black-box engine means faster
reduces vendor lock-in maintenance and deployment
with improved quality

The engine for Big Data is
Hadoop, making it uniquely run
at innite scale.
Technology Continuum

Google Big Query


ELT (SQL CodeGen)
Terradata

Netezza
Vertica

Visual Wizard
Hive/Pig/MR CodeGen
ETL
Java Code HDFS, Sqoop, Oozie
Partionning
Paralelisation
NoSQL
MongoDB
Neo4J
Cassandra
Hbase
Amazon Redshift
Talend Overview
Talend today
At a glance
400 employees in 7 countries with dual HQ in Los Altos, CA
Founded in 2005
and Paris, France
Over 4,000 paying customers across dierent industry
Oers highly scalable
integra7on solu7ons ver7cals and company sizes
addressing Data Integra7on, Backed by Silver Lake Sumeru, Balderton Capital and
Data Quality, MDM, ESB and Idinvest Partners
BPM

Provides: High growth through a proven model


Subscrip7ons including
Brand
24/7 support and Awareness
indemnica7on; 20 million
Worldwide training and Downloads
services

Recognized as the open source Market


leader in each of its market Momentum AdopDon
categories +50 New 1,000,000
Customers / Users
Month

MoneDzaDon

4,000
Customers
Talends Unique Integration Solution

Best-of-
Breed
Solutions
Data Data
Quality Integration MDM ESB BPM

+
Studio Repository Deployment Execu7on Monitoring
Talend
Unified
Comprehensive Web-based Single web-based Platform
Reduce costs
Eclipse-based deployment & monitoring console
Talend
user interface scheduling
Eliminate risk 5 Unified =
1
Reuse skills Consolidated 3 Same container for
metadata & project batch processing, Platform
Unique
Economies of scale
informa7on message rou7ng &
services Integration
Incremental adop7on
2
4 Solution
Recognized as the open source leader in each of its market category by
all industry analysts
Solutions that Scale

Data

Big Data Data Data MDM


Integra7on
ESB BPM
Quality Integra7on

TALEND UNIFIED PLATFORM




Studio Repository Deployment Execu7on Monitoring

UNIFIED PLATFORM
A shared founda7on and toolset increases resource reuse

CONVERGED INTEGRATION
Use for any data, applica7on and
process project
The 6 Dimensions of BIG Data

Primary challenges
Volume
Velocity
Variety

And also
Complexity
Valida7on
Lineage
What Is BIG Data?

3,500 tweets
per second
"Big data" (June 2011)
is informa7on
of extreme size,
diversity, complexity 1,000,000 transac7ons
per day at Walmart
and need for rapid
processing.
200 billion
Ted Friedman - Information
Infrastructure and Big Data Projects
intelligent devices
200,000,000,000
2015
Key Initiative Overview - July 2011

275 exabytes
of data owing over
the Internet each day
275,000,000,000,000,000,000
2020
What is Big Data?
How to
define Big
data is.
Hans Rosling uses big data to analyze world health trends

Key Takeaway #1
volume, variety, velocity
Traditional Data Flows

CRM

ETL Normalized Tradi7onal Data


ERP Data
Data Quality Warehouse

Finance

Scheduleddaily or weekly,
some7mes more frequently. Business Business
Analyst User
Volumes rarely exceed
terabytes Warehouse
Administrator
Execu7ves
The new world of big data

Social Networking

CRM

ERP
Big Data

Finance
The new world of big data

Social Networking

CRM


Mobile Devices

ERP

Big Data
Finance
The new world of big data

Social Networking

CRM


Mobile Devices

ERP

Transac7ons


Finance
Network Devices


Big Data Sensors
Key Takeaway #2

Forces us to think
differently
Data driven business

enables
data governance

supports
information decisions

drives
Information provides
value to the business
If you can't rely on your informa7on then the
result can be missed opportuni7es, or higher Your business
costs.
Mashew West and Julian Fowler (1999). Developing High Quality Data Models.
The European Process Industries STEP Technical Liaison Execu7ve (EPISTLE).
BIG data driven business

enables
BIG data governance

supports
BIG BIG
information decisions

drives
Information provides
value to the business
If you can't rely on your informa7on then the
result can be missed opportuni7es, or higher BIG
costs. business

Mashew West and Julian Fowler (1999). Developing High Quality Data Models.
The European Process Industries STEP Technical Liaison Execu7ve (EPISTLE).
How is big data integration being used?

Use Cases
Recommenda7on Engine
Sen7ment Analysis
Risk Modeling
Fraud Detec7on
Behavior Analysis
Marke7ng Campaign Analysis
Customer Churn Analysis
Social Graph Analysis
Customer Experience Analy7cs
Network Monitoring


BUT: to what level is DQ required for your use case?
Key Takeaway #3

Define your use case

Das könnte Ihnen auch gefallen