Beruflich Dokumente
Kultur Dokumente
ABSTRACT
Nowadays multicores machines are becoming more
and more common. Ideally, all the applications
benefit from these advances in computer
architecture. A complex challenge in parallel
computing is cores load balancing to minimize the
overall execution time called Makespan of the
parallel program. As multicores may have different
architecture, an effective mapping should support
this unknown variation to avoid drawbacks on
Makespan. In fact, mapping or static load balancing
method may not be effective when the target state
machine changes during program execution. In this
context, we propose a predictive approach using
iterations chunking at runtime allowing parallel
code adaptation to processors performance. From a
parallel program, we define a set of loop nest
iterations, forming what is called chunk, and we run
it using a first mapping assuming homogeneous
cores. Then, performance assessment would correct
mapping by speculating the future cores state. The
new mapping would be then applied to a new chunk
for further evaluation and prediction. The process
would stop when the program is fully executed or
when judging that piecewise execution is no longer
effective.
In this paper, a state of art multicore platforms as
well as speculative techniques is presented. Then,
our proposal dealing with performance prediction
process is detailed. Finally, the contribution is
validated with several tests and a comparison is held
with existing static and dynamic approaches.
KEYWORDS
Yosr SLAMA
Elmanar University, Tunisia
yosr.slama@gmail.com
1 Introduction
Several tools have been proposed in the
literature for the parallelization of such
programs, some of which are placed in a
mathematical formal called polyhedral model
(PM). In Targeting homogeneous machines,
approaches based on polyhedral model PM
have proven their effectiveness. However,
when targeting heterogeneous architecture,
these approaches become less effective. In
contrast, the parallel program generated by the
polyhedral model causes a load imbalance
between the processors since it is the result of a
methodology assuming equality between
computing power. Thus, such parallelization
influences in particular the parallel computing
time Makespan. Another problem is the
increasing overhead of communication due to
the waiting time for processors synchronization.
This problem causes parallel system blocking
which decreases the parallel solution
effectiveness. Waiting for the slower processor
sometimes rises a problem of famine between
communicating processors. The heterogeneity
of platforms presents a big challenge for
developers because with multicore machines
the programmer has no idea about the variation
of core performance in execution time. In this
work, we are particularly interested in
314
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
2 STATE OF ART
315
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
316
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
The following
approaches.
figure
shows
the
two
p
P1
P2
P3
P4
317
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
318
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Input: { N, Chunk_size,p}
Initialisation:
coef1=coef2=coef3= =cGoefp=1/p ;
For each processor Pi
Ni <- chunk_size*coefi ;
spec <- 0;
//Begin of parallel execution
For each spec do
// Execution of a chunk
Cores Pi execute Ni ;
Compute parallel times Ti of Pi ;
coefi <- Ti/
;
Ni <- chunk_size*coefi
//End of chunk execution
spec <- spec+1
// End of parallel execution (End of all chunks
execution)
4 EXPERIMENTAL STRATEGY
4.1 Experimental strategy
Our approach has been implemented and
experimented using the MMC (Matrices
Multiplication Code).
319
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
For
the
configuration
C2
(N=2000;
chunk_size=300), the processor P4 is initially
the slowest one (spec = 1, cf. Figure 7). After
three speculations our approach reduces the
difference between the cores parallel time.
320
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
To sum up:
Our approach decreases the including
differences between cores parallel
time
Even for small matrix sizes, our
approach adapts to speed variation
There is always a convergence towards
balancing
Starting with a small chunk is better to
converge faster to load balancing
When the balancing is reached during
successive chunks, it will be better to
use large chunk for the remaining
speculation steps
4.3 Chunk size Impact on Makespan
Figure 8. Intra-chunk speculations load balancing on
C3: N=5000; chunk_size=300
P2
First
150,968
speculation
Second
37,556 37,704
speculation
C2: N=5000, chunk_size=2500
First
457,4
544,196
speculation
Second
564,91 576,79
speculation
P3
P4
212,74
214,205
42,502
42,798
547,389
819,178
577,51
752,26
321
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
322
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Pro
ces
sor
P1
P2
P3
P4
1041,4
9
1044,1
3
1472,9
1474,8
5
1474,87
0,000
033
2182,1
T(sec
)
Dyna
mic
Appr
oche
1828,
19
Conclusion
Speculation was a good way to parallelize code
by piece and re-execute chunks in case of error
detection. By studying the state of the art of
parallelization methods targeting multicore
323
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
REFERENCES
[16]
[1]
[2]
[3]
[4]
[5]
324