Beruflich Dokumente
Kultur Dokumente
Table of contents
Foundation
Hardware architecture
Pipelining
Serial performance
Classification of parallel hardware
Moores law
In 1965 Gordon Moore discovered a trend in the number
of transistor per chip
He discovered a doubling every second year.
This observation has become known as
Moores law
Trend in # of transistors
Year 1971
1000
(103 ) transistors per Core
Year 2011
1000 Mill. (109 ) transistors per Core
=
Six orders of magnitude over 40 years, i.e.
factor 1.995 every 2 year
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3
Performance growth
So why worry about parallel computing?
If you need performance improvement,
spare time parallelising you problem
Wait two years, spend your time with fishing (o.s.e)
Together with increasing clock rate & Moores law
run on a faster computer
Increasing clock rate?
well...
Clock rate growth halted around 2005
Performance Barriers
The free lunch is over!
Several barriers to grow perform. by clock rate & transistors
Memory wall
Growing disparity of speed between CPU and memory
outside the CPU chip. Reason for this is the limited
communication bandwidth beyond chip boundaries.
Power wall Clock frequency cannot be increased without
exceeding air cooling
Instruction-level-parallelism (ILP)
Limits to available low-level-parallelism
ILP is a measure of how many of the operations in a computer
program can be performed simultaneously.
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7
Power wall
Memory wall
Problem to bring the data fast enough to the CPU
Limiting factors
Bandwidth defines the max.
throughput of a communication
path in a digital system.
Measured in Hz.
latency describes the delay
time between request and
recieve of data from RAM,
Storage density is the physical space occupied by
memory
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9
Storage density
Why problematic?
Data cant get fast enough to the CPU
Light travels 0.3 mm in 1012 sec.
Memory must be located in a radius of 0.3 mm around
the CPU to fetch all data in the right time.
Example weather forecast (first lecture)
Simulation stores 20 numbers p. grid point
(Temp., Velocity, pressure, k, . . . )
32 Bit p. number, 1010 grid point 6.4 1012 Bit
Storage density 1 Bit per atom for suitable memory
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10
Circumvention for
Storage density problem
Memory wall
Power wall
ILP wall
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13
Sequential computing
Classical hardware with
undivided memory
stores program & data
Central processing unit
(CPU) that executes the
instructions,
operating on the data
vonNeumann architecture II
Reference model for a sequential computer
Instructions & Data are located in one memory
First universal programmable computer
(before: punch cards/hardware)
Harvard architecture
(Parallel) Harvard architecture
Reference model for fast CPUs (Pipelining)
Instructions & data located in different memories
Advantages
Instruction/data can be handled simultaneously
separated data & instruction bus
RISC (Reduced Instruction Set Computer)
Avoids complex instruction sets which combines memory
access (slow) with arithmetic operations (fast)
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17
Harvard architecture II
Handles instruction & datum in 1 cycle (parallel)
where v.Neumman architecture reads serial (2 cycles)
Vector addition
#define N 1000000
unsigned long i;
double a[N];
double sum=0.0;
.
.
.
for (i=0; i<N; ++i)
sum=sum+a[i];
Performance comparison
Linpack-Benchmark
BLAS (Basic Linear Algebra Subroutines)
Matrix size (Number of Variables)
Pipelining
Decomposition of tasks into several steps
Model: Assembly line car production
Sequential execution
Pipelining
ID Instruction Decode
EX Execution
ID Instruction Decode
EX Execution
Branch prediction
Hardware predicts most probable branch, computes
instruction theoretical,
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33
Inlining
Pre-processor substitutes function call
#define mult(a,b) {(a)*(b)}
...
for(i=0;i<n;i++) x[i]=mult(x[i],1+1./2.);
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34
Inlining
Unrolling the loop Increasing the loop counter
for(i = 0;
x[i] = a
x[i + 1]
x[i + 2]
x[i + 3]
=
=
=
< n; i+=4){
x[i];
a x[i + 1];
a x[i + 2];
a x[i + 3];
Serial performance
CPU performance Operations per time unit
FLoating point OPerations per second (FLOPs)
Computes from
Numbers of pipelines
Operations per pipeline
Mean time unit per pipeline phase
Interested in
Peak performance (Maximal performance)
Mean performance (LinPack test)
Comparison to memory access time
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36
Peak Performance
Processors
Number of pipelines
# ALUs (Arithmetic Logical Units)
Operations per pipeline
#OPsPerALU (generally 2 operations at the same time)
Mean time unit per pipeline step
1
Clock rate CPU, 2,4 GHz 2,410
9 Sek. 417 psec
Peak Performance
Rpeak =
#ALUs #OPpALU
22
=
9, 6GFLOPs
meantimeunit
417psec
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37
Memorylatencytime
Refreshing
Rpeak = 9, 6 GFLOPS
230
417 psec
That means
230 CPU clock cycles takes reading/fetching of a datum
or 920 operations could be executed during this time
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39
Single
Multiple
Instruction Instruction
Single Data
SISD
MISD
Multiple Data
SIMD
MIMD
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40
Communication on DMMs
Local memory is private
Message exchange/communication via network