Beruflich Dokumente
Kultur Dokumente
B DATA
HANDLING LARGE
The CloverETL Cluster Architecture Explained
The Reality:
You have a really big pile to deal with.
g
i
B
la ly
a
t
a
D
e
R
e
R
g
i
B
la ly
a
t
a
D
e
R
g
i
B
la ly
a
t
a
D
or several smaller ones and get the job done faster & cheaper.
Wednesday, August 14, 13
e
R
g
i
B
la ly
a
t
a
D
e
R
g
i
B
la ly
a
t
a
D
With small diggers, failure of one does not affect the rest.
Wednesday, August 14, 13
vs
After
automatic fail-over
Node 1
Node 2
Node 1
Node 2
Load Balancing
Before
After
Node 1
ew
N
Node 1
sk
a
t
The CloverETL
Cluster was born
1
t
r
Pa
2
t
r
Pa
3
t
r
Pa
Wednesday, August 14, 13
Before
=
Now
Part
1
Part
2
Part
3
runs
1x
runs
1x
runs
3x
ate
oc
All
dt
o
Allocated to
All
oc
ate
Allo
cat
ed
to
to
d
te
a
c
o
All
o
dt
Node
Node
We call this
Allocation.
Wednesday, August 14, 13
Node
CloverETL Cluster
Partitioned data
Serial data
Node
1
1st instance
Node
3rd instance
Node
2nd instance
Node
Serial data
Partitioned data
Node
3rd instance
1st instance
Node
Node
2nd instance
Node
Partitioned data
Serial data
Serial data
1st instance
2nd instance
3rd instance
Node
Node
Node
Node
serial processing
Here, were processing the same input data, but in parallel now.
Gather
Split
work in
3 parallel
streams
Partial results
We get a total of 51
records again.
Go parallel in 1 minute.
dr
ag
&d
ro
serial
op
r
&d
g
a
dr
parallel
DONE
Lets continue.
More on allocation and partitioned sandboxes
A Sandbox
We assume you are familiar
with the CloverETL Servers
concept of a SANDBOX.
SANDBOX is a logical name for a file
directory structure managed by the Server. It
allows individual projects on the Server to be
separated into logical units. Each CloverETL
data transformation can access multiple
sandboxes either locally or remotely.
SboxP
Part 1
Part 2
Part 3
Node 1
Partitioned
sandbox
SboxP
Node 2
Node 3
Partitioned
Sandboxes
A partitioned sandbox is a
logical abstraction on top of
similarly structured folders
on different Cluster nodes.
The Sandboxs physical
structure with listed locations/nodes of
files portions
The Sandboxs logical
structure with a unified view of folders & files
Allocation
Allocation Determined By a
Partitioned Sandbox:
4 partitions
4 parallel
transformations.
Theres no gathering at the end - partitioned results are
stored directly to the partitioned sandbox. Allocation for the
aggregator is derived from sandbox being used.
Allocation Determined By an
Explicit Number:
8 parallel transformations.
Partitioning at the beginning and gathering at
the end is necessary as we need to cross the
serialparallel boundary twice.
A Data Skew
This is called a data skew.
Data is not uniformly distributed across partitions.
This indicates that chosen partitioning key is not
the best for the maximum performance.
However, the chosen key allows us to perform only
single pass aggregation (no semi-results) - thus its a
good tradeoff.
The busiest worker will have to process 2.5 million rows whereas the least busy,
only 0.67 million that is, approximately 3.5x less.
Parallel Pitfalls
semi-result1
semi-result 2
semi-result3
semi-result4
semi-result1,2,3,4
final result
Parallel Pitfalls
sum()
count()
here
here
Why ?
Step 1
Example: A parallel counting of occurrences of companies
per state using count().
In step 1, we produce partial results. Because records are
partitioned in a round-robin, data for one state may appear
in multiple parallel streams.
For example, we might get data for NY as 4 partial results
in 4 different streams.
Step 2
In step 2, we merge all the partial results from
the 4 parallel streams into a sequence and then
aggregate again to get the final numbers.
At this step the aggregation function is sum()
we sum the partial counts.
Parallel Pitfalls
Parallel sorting
sort
merge
here
here
Why ?
1
Parallel Pitfalls
Parallel joining
Why ?
Parallel Pitfalls
AZ DE]
stream2 [IL MD NY]
stream3 [OR PA VA]
Result
AZ DE]
2 [IL MD NY]
3 [OR PA VA]
stream 1
stream 2
stream 3
[AL AK AZ AR CA CO CT DC DE FL]
[GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND]
[OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY]
Example
Parallel Pitfalls
Result
NY]
stream 1[AL
AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY]
stream 2[AK CA DC IL KS ME MI MO NV NM ND OK RI TN VT WV]
stream 3[AZ CO DE ID IN KY MD MN MT NH NY OR SC TX VA WI]
Example
Wednesday, August 14, 13