Sie sind auf Seite 1von 42

G

B DATA
HANDLING LARGE
The CloverETL Cluster Architecture Explained

Wednesday, August 14, 13

The Reality:
You have a really big pile to deal with.

g
i
B
la ly

a
t
a
D

e
R

One traditional digger might not be enough.


Wednesday, August 14, 13

e
R

g
i
B
la ly

a
t
a
D

You could get a really big, expensive digger...


Wednesday, August 14, 13

e
R

g
i
B
la ly

a
t
a
D

or several smaller ones and get the job done faster & cheaper.
Wednesday, August 14, 13

e
R

g
i
B
la ly

a
t
a
D

But what if the one big one suffers a mechanical failure?


Wednesday, August 14, 13

e
R

g
i
B
la ly

a
t
a
D

With small diggers, failure of one does not affect the rest.
Wednesday, August 14, 13

Which one do you choose ?

vs

Wednesday, August 14, 13

CloverETL Cluster resiliency features


Optimizing for robustness...

Wednesday, August 14, 13

Fault resiliency HW & SW


Before

After

automatic fail-over

Node 1

Wednesday, August 14, 13

Node 2

Node 1

Node 2

Load Balancing
Before

After
Node 1

ew
N

Node 1

sk
a
t

automatic load balancing


Node 2
Node 2

Wednesday, August 14, 13

CloverETL Cluster - BIG DATA features


Optimizing for speed...

Wednesday, August 14, 13

Traditionally, data transformations were run on a single, big server


with multiple CPUs and plenty of RAM.

And it was expensive.

Wednesday, August 14, 13

Then the CloverETL team


developed the concept of a data
transformation cluster.

The CloverETL
Cluster was born

It creates a powerful data transformation beast from a set of low-cost


commodity hardware machines.

Wednesday, August 14, 13

Now, one data transformation can be set to run in parallel on


all available nodes of the CloverETL Cluster.

Wednesday, August 14, 13

Each cluster node executing the


transformation is automatically fed with a
different portion of the input data.

1
t
r
Pa
2
t
r
Pa

3
t
r
Pa
Wednesday, August 14, 13

Before
=

Now

Working in parallel, they finish the job faster,


with less resources needed individually.

Part
1
Part
2
Part
3

Wednesday, August 14, 13

That sounds nice and simple.


But how is it really done?

Wednesday, August 14, 13

runs
1x

runs
1x

runs
3x

ate
oc
All

dt
o

Allocated to

All
oc
ate

Allo

cat
ed

to

to
d
te

a
c
o
All

o
dt

CloverETL allows certain


transformation components to be
assigned to multiple cluster nodes.

Node

Node

Such components then run in multiple instances.

We call this
Allocation.
Wednesday, August 14, 13

Node

CloverETL Cluster

Partitioned data

Serial data

Node

1
1st instance

Node

Special components allow


incoming data to be split
and sent in parallel flows to
multiple nodes where the
processing flow continues.

Wednesday, August 14, 13

3rd instance

Node
2nd instance

Node

Serial data

Partitioned data

Node
3rd instance

1st instance

Node

Node

2nd instance

Node

Wednesday, August 14, 13

Other components gather


data from parallel flows back
into a single, serial one.

Partitioned data

Serial data

Serial data

1st instance

2nd instance

3rd instance

Node

Node

Node

Node

The original transformation is automatically


rewritten into several smaller ones, which
are executed by cluster nodes in parallel.

Which nodes will be used is determined by


Allocation.

Wednesday, August 14, 13

Lets take a look


at an example.

Wednesday, August 14, 13

In this example, well read data about company


addresses. There are 10,499,849 records in total.

serial processing

We get a total of 51 records one


record per US state.

Wednesday, August 14, 13

We also calculate statistics of the number


of companies residing in each US state.

Here, were processing the same input data, but in parallel now.
Gather

Split

work in
3 parallel
streams

Each parallel stream


gets a portion of the
input data

Partial results

We get a total of 51
records again.

Wednesday, August 14, 13

Go parallel in 1 minute.

dr
ag
&d
ro

serial
op
r
&d
g
a
dr

parallel

Wednesday, August 14, 13

Whats the Trick?


Split the input data into
parallel streams.

Do the heavy lifting on smaller data


portions in parallel.

Bring the individual pieces of


results together at the end.
Wednesday, August 14, 13

DONE

Lets continue.
More on allocation and partitioned sandboxes

Wednesday, August 14, 13

A Sandbox
We assume you are familiar
with the CloverETL Servers
concept of a SANDBOX.
SANDBOX is a logical name for a file
directory structure managed by the Server. It
allows individual projects on the Server to be
separated into logical units. Each CloverETL
data transformation can access multiple
sandboxes either locally or remotely.

Lets look at a special type of


sandbox partitioned

Wednesday, August 14, 13

In a partitioned Sandbox, the input file is split into subfiles,


each residing on a different node of the Cluster in a similarly
structured folder.

SboxP
Part 1
Part 2

Part 3

Node 1

Partitioned
sandbox
SboxP

Node 2

Node 3

The sandbox presents originals combined data.

Wednesday, August 14, 13

Partitioned
Sandboxes
A partitioned sandbox is a
logical abstraction on top of
similarly structured folders
on different Cluster nodes.
The Sandboxs physical
structure with listed locations/nodes of
files portions
The Sandboxs logical
structure with a unified view of folders & files

Wednesday, August 14, 13

Data processing happens where data resides.


Partitioned Sandbox

Allocation defines how a


transformations run is distributed
across nodes of the CloverETL Cluster

Partitioned sandbox defines how


data is partitioned
across nodes of the CloverETL
Cluster

Allocation

The allocation can be set to derive from the sandbox layout.

We tell the cluster to run our transformation


components on nodes that also contain portions of
data we want to process.
Wednesday, August 14, 13

Allocation Determined By a
Partitioned Sandbox:

4 partitions

4 parallel

transformations.
Theres no gathering at the end - partitioned results are
stored directly to the partitioned sandbox. Allocation for the
aggregator is derived from sandbox being used.

Wednesday, August 14, 13

Allocation Determined By an
Explicit Number:
8 parallel transformations.
Partitioning at the beginning and gathering at
the end is necessary as we need to cross the
serialparallel boundary twice.

Wednesday, August 14, 13

A Data Skew
This is called a data skew.
Data is not uniformly distributed across partitions.
This indicates that chosen partitioning key is not
the best for the maximum performance.
However, the chosen key allows us to perform only
single pass aggregation (no semi-results) - thus its a
good tradeoff.

The busiest worker will have to process 2.5 million rows whereas the least busy,
only 0.67 million that is, approximately 3.5x less.

Wednesday, August 14, 13

Aggregating, Sorting, Joining

Parallel Pitfalls

When processing data in parallel, a few things should be considered.


Working in parallel means producing parallel/semi results.
record stream1
record stream2
record stream3
record stream4

semi-result1
semi-result 2
semi-result3
semi-result4

First, we produce 4 aggregated


semi-results. Then we aggregate the
semi-results to get the final result.

semi-result1,2,3,4

final result

These partial results have to be further


processed to get final result.

The good news: When increasing or changing the


number of parallel streams, we dont have to
change the transformation.

Wednesday, August 14, 13

Aggregating, Sorting, Joining

Parallel Pitfalls

Full transformation parallel aggregation & post-processing semi results

sum()

count()

here

here

Why ?
Step 1
Example: A parallel counting of occurrences of companies
per state using count().
In step 1, we produce partial results. Because records are
partitioned in a round-robin, data for one state may appear
in multiple parallel streams.
For example, we might get data for NY as 4 partial results
in 4 different streams.

Wednesday, August 14, 13

Step 2
In step 2, we merge all the partial results from
the 4 parallel streams into a sequence and then
aggregate again to get the final numbers.
At this step the aggregation function is sum()
we sum the partial counts.

Aggregating, Sorting, Joining

Parallel Pitfalls

Parallel sorting

sort

merge

here

here

Why ?
1

Wednesday, August 14, 13

Sorting in parallel records are sorted in


individual parallel streams, but not across all
streams.

Bringing parallel sorted streams together


into serial stream records have to be
merged according to the same key as
used in parallel sorting to produce
overall sorted serial result.

Parallel Pitfalls

Aggregating, Sorting, Joining

Parallel joining

Why ?

Joining in parallelmaster&slave(s) records


must be partitioned by the same key/field. The
same key must be used for joining records.

Wednesday, August 14, 13

In another case, there is a danger that records


from master & slave with the same key will not
join as they end up in different parallel streams.
Joiner joins only within one stream and not
across streams.

Aggregating, Sorting, Joining

Parallel Pitfalls

Parallel joining - 3 parallel streams - partitioning by state


stream 1 [AK

AZ DE]
stream2 [IL MD NY]
stream3 [OR PA VA]

Result

(all master records joined)


1 [AK

AZ DE]
2 [IL MD NY]
3 [OR PA VA]
stream 1
stream 2
stream 3

[AL AK AZ AR CA CO CT DC DE FL]
[GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND]
[OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY]

Example

Wednesday, August 14, 13

Aggregating, Sorting, Joining

Parallel Pitfalls

Parallel joining - 3 parallel streams - partitioning round robin


[AK IL OR]
stream 2 [AZ MD VA]
stream 3 [DE NY PA]
stream 1

Result

(some master records joined)


1 []
2 []
3 [DE

NY]

stream 1[AL

AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY]
stream 2[AK CA DC IL KS ME MI MO NV NM ND OK RI TN VT WV]
stream 3[AZ CO DE ID IN KY MD MN MT NH NY OR SC TX VA WI]

Example
Wednesday, August 14, 13

Bringing it all together


CloverETL Cluster has built in fault resiliency and load balancing
BIG DATA problems are handled through Clusters scalability
Existing transformations can be easily converted to parallel
Theres no magic users have full control over whats happening

Going parallel is easy!


Try it out for yourself.

Wednesday, August 14, 13

If you have any questions, check out:


www.cloveretl.com
forum.cloveretl.com
blog.cloveretl.com

Wednesday, August 14, 13

Das könnte Ihnen auch gefallen