Beruflich Dokumente
Kultur Dokumente
Hadoop
is
complex
BI
consultants
dont
have
the
skill
to
manipulate
Hadoop
Biggest
issue
in
Hadoop
project
is
to
nd
skilled
Hadoop
Engineers
Why Talend
Talend generates code that is executed within map reduce. This open
approach removes the limitation of a proprietary engine to provide a truly
unique and powerful set of tools for big data.
BIG Data Management
Big
Data
Big
Data
Management
Produc7on
E-T-L
Extract
Transform
-
Load
E-
DQ
-L
Extract
Improve/Cleanse
-
Load
Pipelining: data quality with big data
CRM
DQ
ERP DQ
DQ
Finance
Big Data
DQ
Social
Networking
Use
tradi7onal
data
quality
tools
DQ Once
and
done
Mobile Devices
Big data alternative: Load and improve within
the cluster
CRM
DQ
ERP
DQ
Finance
Big Data
Social
Networking
Load
rst,
improve
later
Complex
matching
cannot
be
done
outside
Mobile Devices
One key DQ rules: Match
Find
duplicates
within
Hadoop
Todays
matching
algorithms
are
processor-intensive
Sqoop
Pig
Hive
Oozie
HCatalog
Mahout
HBase
Talend for Big Data : Hadoop Story
HCatalog
Democratizing Integration with Data
Integration tools for Big Data
WordCount
WordCount
Comes
with
Hadoop
First
demo
that
everyone
tries!
Data Node 1
Data
Node
2
Map Reduce
Map Reduce
Map Reduce
Map Reduce
In Java
WordCount: Howto with a Graphical ETL
Thank You!
Let
us
show
you
Talend Open Studio
Generate Pure Map Reduce
Pig Latin Script generation
HCatalog
accros
HDFS
Hive
Pig
Thank You!
Choose your Hadoop distro
Widely
adopted
Management
tooling
is
not
OSS
Fully
OpenSource
Strong
Developer
ecosystem
More
proprietary
GTM
partner
with
AWS
Provide
tooling
:
For
installa7on
But
No
GUI
for
parsing,
transforming,
easily
loading.
No
data
management
Parse and Standardize
DATA
Challenge:
Informa7on
explosion
increases
complexity
of
integra7on
and
requires
governance
to
maintain
data
quality
Requirement:
Informa7on
processing
must
scale
Implications for Integration
APPLICATION
PROCESS
True
scalability for
Any
integra7on
challenge
Any
data
volume
Any
project
size
Enables integration
convergence
Technology that Scales
Map
Java
SQL
Camel
Reduce
ELT (SQL CodeGen)
Terradata
Netezza
Vertica
Visual Wizard
Hive/Pig/MR CodeGen
ETL
Java Code HDFS, Sqoop, Oozie
Partionning
Paralelisation
NoSQL
MongoDB
Neo4J
Cassandra
Hbase
Amazon Redshift
Talend Overview
Talend
today
At a glance
400
employees
in
7
countries
with
dual
HQ
in
Los
Altos,
CA
Founded
in
2005
and
Paris,
France
Over
4,000
paying
customers
across
dierent
industry
Oers
highly
scalable
integra7on
solu7ons
ver7cals
and
company
sizes
addressing
Data
Integra7on,
Backed
by
Silver
Lake
Sumeru,
Balderton
Capital
and
Data
Quality,
MDM,
ESB
and
Idinvest
Partners
BPM
MoneDzaDon
4,000
Customers
Talends Unique Integration Solution
Best-of-
Breed
Solutions
Data Data
Quality Integration MDM ESB BPM
+
Studio
Repository
Deployment
Execu7on
Monitoring
Talend
Unified
Comprehensive
Web-based
Single
web-based
Platform
Reduce
costs
Eclipse-based
deployment
&
monitoring
console
Talend
user
interface
scheduling
Eliminate
risk
5 Unified =
1
Reuse
skills
Consolidated
3 Same
container
for
metadata
&
project
batch
processing,
Platform
Unique
Economies
of
scale
informa7on
message
rou7ng
&
services
Integration
Incremental
adop7on
2
4 Solution
Recognized
as
the
open
source
leader
in
each
of
its
market
category
by
all
industry
analysts
Solutions that Scale
Data
UNIFIED PLATFORM
A
shared
founda7on
and
toolset
increases
resource
reuse
CONVERGED INTEGRATION
Use
for
any
data,
applica7on
and
process
project
The 6 Dimensions of BIG Data
Primary
challenges
Volume
Velocity
Variety
And
also
Complexity
Valida7on
Lineage
What Is BIG Data?
3,500
tweets
per
second
"Big
data"
(June
2011)
is
informa7on
of
extreme
size,
diversity,
complexity
1,000,000
transac7ons
per
day
at
Walmart
and
need
for
rapid
processing.
200
billion
Ted Friedman - Information
Infrastructure and Big Data Projects
intelligent
devices
200,000,000,000
2015
Key Initiative Overview - July 2011
275
exabytes
of
data
owing
over
the
Internet
each
day
275,000,000,000,000,000,000
2020
What is Big Data?
How to
define Big
data is.
Hans
Rosling
uses
big
data
to
analyze
world
health
trends
Key
Takeaway
#1
volume, variety, velocity
Traditional Data Flows
CRM
Finance
Scheduleddaily
or
weekly,
some7mes
more
frequently.
Business
Business
Analyst
User
Volumes
rarely
exceed
terabytes
Warehouse
Administrator
Execu7ves
The new world of big data
Social
Networking
CRM
ERP
Big Data
Finance
The new world of big data
Social
Networking
CRM
Mobile
Devices
ERP
Big Data
Finance
The new world of big data
Social
Networking
CRM
Mobile
Devices
ERP
Transac7ons
Finance
Network
Devices
Big Data Sensors
Key
Takeaway
#2
Forces us to think
differently
Data driven business
enables
data governance
supports
information decisions
drives
Information provides
value to the business
If
you
can't
rely
on
your
informa7on
then
the
result
can
be
missed
opportuni7es,
or
higher
Your business
costs.
Mashew
West
and
Julian
Fowler
(1999).
Developing
High
Quality
Data
Models.
The
European
Process
Industries
STEP
Technical
Liaison
Execu7ve
(EPISTLE).
BIG data driven business
enables
BIG
data
governance
supports
BIG
BIG
information decisions
drives
Information provides
value to the business
If
you
can't
rely
on
your
informa7on
then
the
result
can
be
missed
opportuni7es,
or
higher
BIG
costs.
business
Mashew
West
and
Julian
Fowler
(1999).
Developing
High
Quality
Data
Models.
The
European
Process
Industries
STEP
Technical
Liaison
Execu7ve
(EPISTLE).
How is big data integration being used?
Use
Cases
Recommenda7on
Engine
Sen7ment
Analysis
Risk
Modeling
Fraud
Detec7on
Behavior
Analysis
Marke7ng
Campaign
Analysis
Customer
Churn
Analysis
Social
Graph
Analysis
Customer
Experience
Analy7cs
Network
Monitoring
BUT:
to
what
level
is
DQ
required
for
your
use
case?
Key
Takeaway
#3