Handling Large Data: The Cloveretl Cluster Architecture Explained

G
B DATA
HANDLING LARGE
The CloverETL Cluster Architecture Explained
Wednesday, August 14, 13
The Reality:
You have a really big pile to deal with.
g
i
B
la ly
a
t
a
D
e
R
One traditional digger might not be enough.

e
R
g
i
B
la ly
a
t
a
D
You could get a really big, expensive digger...

e
R
g
i
B
la ly
a
t
a
D
or several smaller ones and get the job done faster & cheaper.
e
R
g
i
B
la ly
a
t
a
D
But what if the one big one suffers a mechanical failure?

e
R
g
i
B
la ly
a
t
a
D
With small diggers, failure of one does not affect the rest.
Which one do you choose ?
vs
CloverETL Cluster resiliency features

Optimizing for robustness...
Fault resiliency HW & SW

Before
After
automatic fail-over
Node 1
Node 2
Node 1
Node 2
Load Balancing
Before
After
Node 1
ew
N
Node 1
sk
a
t
automatic load balancing

Node 2
Node 2
CloverETL Cluster - BIG DATA features

Optimizing for speed...
Traditionally, data transformations were run on a single, big server

with multiple CPUs and plenty of RAM.
And it was expensive.
Then the CloverETL team

developed the concept of a data
transformation cluster.
The CloverETL
Cluster was born
It creates a powerful data transformation beast from a set of low-cost

commodity hardware machines.
Now, one data transformation can be set to run in parallel on

all available nodes of the CloverETL Cluster.
Each cluster node executing the

transformation is automatically fed with a
different portion of the input data.
1
t
r
Pa
2
t
r
Pa
3
t
r
Pa
Before
=
Now
Working in parallel, they finish the job faster,

with less resources needed individually.
Part
1
Part
2
Part
3
That sounds nice and simple.

But how is it really done?
runs
1x
runs
1x
runs
3x
ate
oc
All
dt
o
Allocated to
All
oc
ate
Allo
cat
ed
to
to
d
te
a
c
o
All
o
dt
CloverETL allows certain

transformation components to be
assigned to multiple cluster nodes.
Node
Node
Such components then run in multiple instances.
We call this
Allocation.
Node
CloverETL Cluster
Partitioned data
Serial data
Node
1
1st instance
Node
Special components allow

incoming data to be split
and sent in parallel flows to
multiple nodes where the
processing flow continues.
3rd instance
Node
2nd instance
Node
Serial data
Partitioned data
Node
3rd instance
1st instance
Node
Node
2nd instance
Node
Other components gather

data from parallel flows back
into a single, serial one.
Partitioned data
Serial data
Serial data
1st instance
2nd instance
3rd instance
Node
Node
Node
Node
The original transformation is automatically

rewritten into several smaller ones, which
are executed by cluster nodes in parallel.
Which nodes will be used is determined by

Allocation.
Lets take a look

at an example.
In this example, well read data about company

addresses. There are 10,499,849 records in total.
serial processing
We get a total of 51 records one

record per US state.
We also calculate statistics of the number

of companies residing in each US state.
Here, were processing the same input data, but in parallel now.
Gather
Split
work in
3 parallel
streams
Each parallel stream

gets a portion of the
input data
Partial results
We get a total of 51
records again.
Go parallel in 1 minute.
dr
ag
&d
ro
serial
op
r
&d
g
a
dr
parallel
Whats the Trick?

Split the input data into
parallel streams.
Do the heavy lifting on smaller data

portions in parallel.
Bring the individual pieces of

results together at the end.
DONE
Lets continue.
More on allocation and partitioned sandboxes
A Sandbox
We assume you are familiar
with the CloverETL Servers
concept of a SANDBOX.
SANDBOX is a logical name for a file
directory structure managed by the Server. It
allows individual projects on the Server to be
separated into logical units. Each CloverETL
data transformation can access multiple
sandboxes either locally or remotely.
Lets look at a special type of

sandbox partitioned
In a partitioned Sandbox, the input file is split into subfiles,

each residing on a different node of the Cluster in a similarly
structured folder.
SboxP
Part 1
Part 2
Part 3
Node 1
Partitioned
sandbox
SboxP
Node 2
Node 3
The sandbox presents originals combined data.
Partitioned
Sandboxes
A partitioned sandbox is a
logical abstraction on top of
similarly structured folders
on different Cluster nodes.
The Sandboxs physical
structure with listed locations/nodes of
files portions
The Sandboxs logical
structure with a unified view of folders & files
Data processing happens where data resides.

Partitioned Sandbox
Allocation defines how a

transformations run is distributed
across nodes of the CloverETL Cluster
Partitioned sandbox defines how

data is partitioned
across nodes of the CloverETL
Cluster
Allocation
The allocation can be set to derive from the sandbox layout.
We tell the cluster to run our transformation

components on nodes that also contain portions of
data we want to process.
Allocation Determined By a
Partitioned Sandbox:
4 partitions
4 parallel
transformations.
Theres no gathering at the end - partitioned results are
stored directly to the partitioned sandbox. Allocation for the
aggregator is derived from sandbox being used.
Allocation Determined By an
Explicit Number:
8 parallel transformations.
Partitioning at the beginning and gathering at
the end is necessary as we need to cross the
serialparallel boundary twice.
A Data Skew
This is called a data skew.
Data is not uniformly distributed across partitions.
This indicates that chosen partitioning key is not
the best for the maximum performance.
However, the chosen key allows us to perform only
single pass aggregation (no semi-results) - thus its a
good tradeoff.
The busiest worker will have to process 2.5 million rows whereas the least busy,
only 0.67 million that is, approximately 3.5x less.
Aggregating, Sorting, Joining
Parallel Pitfalls
When processing data in parallel, a few things should be considered.

Working in parallel means producing parallel/semi results.
record stream1
record stream2
record stream3
record stream4
semi-result1
semi-result 2
semi-result3
semi-result4
First, we produce 4 aggregated

semi-results. Then we aggregate the
semi-results to get the final result.
semi-result1,2,3,4
final result
These partial results have to be further

processed to get final result.
The good news: When increasing or changing the

number of parallel streams, we dont have to
change the transformation.
Parallel Pitfalls
Full transformation parallel aggregation & post-processing semi results
sum()
count()
here
here
Why ?
Step 1
Example: A parallel counting of occurrences of companies
per state using count().
In step 1, we produce partial results. Because records are
partitioned in a round-robin, data for one state may appear
in multiple parallel streams.
For example, we might get data for NY as 4 partial results
in 4 different streams.
Step 2
In step 2, we merge all the partial results from
the 4 parallel streams into a sequence and then
aggregate again to get the final numbers.
At this step the aggregation function is sum()
we sum the partial counts.
Parallel Pitfalls
Parallel sorting
sort
merge
here
here
Why ?
1
Sorting in parallel records are sorted in

individual parallel streams, but not across all
streams.
Bringing parallel sorted streams together

into serial stream records have to be
merged according to the same key as
used in parallel sorting to produce
overall sorted serial result.
Parallel Pitfalls
Parallel joining
Why ?
Joining in parallelmaster&slave(s) records

must be partitioned by the same key/field. The
same key must be used for joining records.
In another case, there is a danger that records

from master & slave with the same key will not
join as they end up in different parallel streams.
Joiner joins only within one stream and not
across streams.
Parallel Pitfalls
Parallel joining - 3 parallel streams - partitioning by state

stream 1 [AK
AZ DE]
stream2 [IL MD NY]
stream3 [OR PA VA]
Result
(all master records joined)

1 [AK
AZ DE]
2 [IL MD NY]
3 [OR PA VA]
stream 1
stream 2
stream 3
[AL AK AZ AR CA CO CT DC DE FL]
[GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND]
[OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY]
Example
Parallel Pitfalls
Parallel joining - 3 parallel streams - partitioning round robin

[AK IL OR]
stream 2 [AZ MD VA]
stream 3 [DE NY PA]
stream 1
Result
(some master records joined)

1 []
2 []
3 [DE
NY]
stream 1[AL
AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY]
stream 2[AK CA DC IL KS ME MI MO NV NM ND OK RI TN VT WV]
stream 3[AZ CO DE ID IN KY MD MN MT NH NY OR SC TX VA WI]
Example
Bringing it all together

CloverETL Cluster has built in fault resiliency and load balancing
BIG DATA problems are handled through Clusters scalability
Existing transformations can be easily converted to parallel
Theres no magic users have full control over whats happening
Going parallel is easy!

Try it out for yourself.
If you have any questions, check out:

www.cloveretl.com
forum.cloveretl.com
blog.cloveretl.com

Handling Large Data: The Cloveretl Cluster Architecture Explained

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Handling Large Data: The Cloveretl Cluster Architecture Explained

Hochgeladen von

Copyright:

Verfügbare Formate

G

Wednesday, August 14, 13

One traditional digger might not be enough.

You could get a really big, expensive digger...

But what if the one big one suffers a mechanical failure?

Which one do you choose ?

Wednesday, August 14, 13

CloverETL Cluster resiliency features

Wednesday, August 14, 13

Fault resiliency HW & SW

Wednesday, August 14, 13

automatic load balancing

Wednesday, August 14, 13

CloverETL Cluster - BIG DATA features

Wednesday, August 14, 13

Traditionally, data transformations were run on a single, big server

And it was expensive.

Wednesday, August 14, 13

Then the CloverETL team

It creates a powerful data transformation beast from a set of low-cost

Wednesday, August 14, 13

Now, one data transformation can be set to run in parallel on

Wednesday, August 14, 13

Each cluster node executing the

Working in parallel, they finish the job faster,

Wednesday, August 14, 13

That sounds nice and simple.

Wednesday, August 14, 13

CloverETL allows certain

Such components then run in multiple instances.

Special components allow

Wednesday, August 14, 13

Wednesday, August 14, 13

Other components gather

The original transformation is automatically

Which nodes will be used is determined by

Wednesday, August 14, 13

Lets take a look

Wednesday, August 14, 13

In this example, well read data about company

We get a total of 51 records one

Wednesday, August 14, 13

We also calculate statistics of the number

Each parallel stream

Wednesday, August 14, 13

Wednesday, August 14, 13

Whats the Trick?

Do the heavy lifting on smaller data

Bring the individual pieces of

Wednesday, August 14, 13

Lets look at a special type of

Wednesday, August 14, 13

In a partitioned Sandbox, the input file is split into subfiles,

The sandbox presents originals combined data.

Wednesday, August 14, 13

Wednesday, August 14, 13

Data processing happens where data resides.

Allocation defines how a

Partitioned sandbox defines how

The allocation can be set to derive from the sandbox layout.

We tell the cluster to run our transformation

Wednesday, August 14, 13

Wednesday, August 14, 13