Sie sind auf Seite 1von 52

Parallel Computing I

Parallel Hardware Architecture


Thorsten Grahs, 11. May 2015

Table of contents
Foundation
Hardware architecture
Pipelining
Serial performance
Classification of parallel hardware

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Moores law
In 1965 Gordon Moore discovered a trend in the number
of transistor per chip
He discovered a doubling every second year.
This observation has become known as
Moores law

Trend in # of transistors
Year 1971
1000
(103 ) transistors per Core
Year 2011
1000 Mill. (109 ) transistors per Core
=
Six orders of magnitude over 40 years, i.e.
factor 1.995 every 2 year
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Transistors per Processor over Time

Continues to grow exponentially (Moores Law)


11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Performance growth
So why worry about parallel computing?
If you need performance improvement,
spare time parallelising you problem
Wait two years, spend your time with fishing (o.s.e)
Together with increasing clock rate & Moores law
run on a faster computer
Increasing clock rate?
well...
Clock rate growth halted around 2005

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Processor Clock Rate over Time

Stops to grow exponentially (Power wall)


11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

Performance Barriers
The free lunch is over!
Several barriers to grow perform. by clock rate & transistors
Memory wall
Growing disparity of speed between CPU and memory
outside the CPU chip. Reason for this is the limited
communication bandwidth beyond chip boundaries.
Power wall Clock frequency cannot be increased without
exceeding air cooling
Instruction-level-parallelism (ILP)
Limits to available low-level-parallelism
ILP is a measure of how many of the operations in a computer
program can be performed simultaneously.
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

Power wall

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

Memory wall
Problem to bring the data fast enough to the CPU
Limiting factors
Bandwidth defines the max.
throughput of a communication
path in a digital system.
Measured in Hz.
latency describes the delay
time between request and
recieve of data from RAM,
Storage density is the physical space occupied by
memory
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

Storage density
Why problematic?
Data cant get fast enough to the CPU
Light travels 0.3 mm in 1012 sec.
Memory must be located in a radius of 0.3 mm around
the CPU to fetch all data in the right time.
Example weather forecast (first lecture)
Simulation stores 20 numbers p. grid point
(Temp., Velocity, pressure, k, . . . )
32 Bit p. number, 1010 grid point 6.4 1012 Bit
Storage density 1 Bit per atom for suitable memory
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Need for explicit parallel mechanisms


Conclusion
Requirements for performance scaling
explicit parallel mechanisms
explicit parallel programming
Outcome
Many-core cpus
Parallel programming
Distributes memory computing
Cluster computer/GPGPUs
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

The answers to which question?


Why not using one big master mind
If you were ploughing a field , what would you rather use?
Two strong oxen or 1024 chickens?
Seymour Cray(1925 1996)
Founder of Cray Inc..
Answer to Seymour Cray
(and foundation of distributed memory computing)

Because this is unflexible and sedate!


To pull a bigger wagon, it is easier to add more oxen
than to grow a gigantic oxen
W. Groppe, E. Lusk, A. Skjellum in Using MPI
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Example: Distributed memory machine


Building a cluster
(i.e. a distributed memory machines) Idea
Use f.i. 1000 standard PCs with 109 -FLOPs
Domain Decomposition
Each processor is responsible for 106 grid points
Distance Memory-CPU could be up to 300 mm
Memory requirement per CPU 800 MByte

Circumvention for
Storage density problem
Memory wall
Power wall
ILP wall
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

Sequential computing
Classical hardware with
undivided memory
stores program & data
Central processing unit
(CPU) that executes the
instructions,
operating on the data

Performance thats whats all about!


GFLOPs = GHz? Holds for vector or super scalar processors,
but only for ideal situations (peak performance)
For practical applications peak perform. is hard to access.
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

von Neumann architecture


John von Neumann (1903 1957)
german/hungarian mathematician
since 1933 Princeton
Institute for Advanced Studies
Work related to mathematics, quantum
physics and computer science
Electronic Numerical Integrator And Computer (ENIAC)
program (Univ. Pennsylvania) with J. Eckert & J. Mauchly
Electronic Discrete Variable Automatic Computer
EDVAC program for the US-Army: Ballistic calculation
First Draft of a Report on the EDVAC, 1945
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

vonNeumann architecture II
Reference model for a sequential computer
Instructions & Data are located in one memory
First universal programmable computer
(before: punch cards/hardware)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Harvard architecture
(Parallel) Harvard architecture
Reference model for fast CPUs (Pipelining)
Instructions & data located in different memories
Advantages
Instruction/data can be handled simultaneously
separated data & instruction bus
RISC (Reduced Instruction Set Computer)
Avoids complex instruction sets which combines memory
access (slow) with arithmetic operations (fast)
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Harvard architecture II
Handles instruction & datum in 1 cycle (parallel)
where v.Neumman architecture reads serial (2 cycles)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Example | Vector add


Task list for sequential computing

Vector addition
#define N 1000000
unsigned long i;
double a[N];
double sum=0.0;
.
.
.
for (i=0; i<N; ++i)
sum=sum+a[i];

Fetch next instruction from


memory into register
Interpretation of the instruction
Load first argument (sum) from
memory into register
Load 2nd argument (a[i]) from
memory into additional register
Execute instruction and store
result into third register
Write result (sum) back to memory

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Von-Neumann bottle neck


Performance
Main barrier is the memory
DRAM has f.i. 1066 MHz
Modern CPUs 100 faster, i.e. memory.
thwart computat. power of serial computer
Remedy
Fast memory buffers to avoid (re) access
to slower main memory
Cache hierarchy L1,L2,L3-Cache
RISC & pipelining
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Performance comparison
Linpack-Benchmark
BLAS (Basic Linear Algebra Subroutines)
Matrix size (Number of Variables)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Complex Instruction Set Computers CISC


CISC (used before 1985)
Simpler programming with complex operations in one
instruction (software was expensive)
Bounded memory size favour for short programs
(complex instructions compact programs)
Disadvantage
Complex instruction implies slower clock rate
Pipelining is harder to implement
Longer development cycles
error prone
normally only 30% of the instructions used
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Reduced Instruction Set Computers RISC


RISC (used since 1985)
Memory became cheaper and faster
Decoding and executing instructions became the limiting
factor
Advantage
Simpler software design
Easier to debug
Shorter development cycles
cheaper to produce
faster execution
more potential for optimizations
(many dependencies between complex instructions)
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Why RISC is faster?


CISC short programs with complex instruction sets
RISC long programs with simple instructions
Methods used in RISC computers
Instruction pipelines
parallel execution of instruction sets
(fetch, decode, execute, write)
Cached memory
small but fast memory buffer between
main memory and CPU
super scalar execution
instruction agglomeration for parallel execution
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Pipelining
Decomposition of tasks into several steps
Model: Assembly line car production

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Pipelining | Example: laundry I


Anna, Britta, Caspar & Daniel:
Each of them have a pack of dirty
laundry to wash, dry and fold
Washing takes 30 minutes
Drying takes 40 minutes
Folding takes 20 minutes

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Pipelining | Example: laundry II

Sequential execution

(6 h for 4 laundry heaps)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Pipelining | Example: laundry III

Pipelining

(3,5 h for 4 laundry heaps)


11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Pipelining | Example: laundry IV


Pipelining increases
through put (Op./time)
Depends on longest
operation step
Unbalanced length of
steps reduce speed-up
Potential speed-up
= # single operations

Pipelining for laundry

(3,5 h for 4 laundry heaps)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

Pipelining | Logical steps


Logical steps in a computer
IF Instruction Fetch
OF Operand Fetch
OS Operand Store

ID Instruction Decode
EX Execution

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Pipelining | Logical steps II


Logical instruction phases
IF Instruction Fetch
OF Operand Fetch
OS Operand Store

ID Instruction Decode
EX Execution

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Problems Pipelining dependencies


Data dependencies
One instruction may depend on results from the former
instruction
a=x+y;
b=a*c;
Computation of b must wait for end of computation of a
Releasing data dependency:
Computation of a can be executed in forehand

Out of order execution


Compiler changes execution order to solve data
dependencies
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Problems Pipelining branches


Conditional control flow
Control flow/branches depends on former instructions
if(d>2)
{x=0.;}
else
{x=1.;}

Branch prediction
Hardware predicts most probable branch, computes
instruction theoretical,
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

Problems Pipelining Inlining


Unconditional control flow
The control flow has to be interrupted
double mult(a,b) {return(a*b);}
...
for(i=0;i<n;i++) x[i]=mult(x[i],1+1./2.);

Inlining
Pre-processor substitutes function call
#define mult(a,b) {(a)*(b)}
...
for(i=0;i<n;i++) x[i]=mult(x[i],1+1./2.);
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Problems Pipelining loop unrolling


Unconditional control flow
The control flow has to be interrupted
for(i = 0; i < n; i++)
x[i] = a x[i];

Inlining
Unrolling the loop Increasing the loop counter
for(i = 0;
x[i] = a
x[i + 1]
x[i + 2]
x[i + 3]

=
=
=

< n; i+=4){
x[i];
a x[i + 1];
a x[i + 2];
a x[i + 3];

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Serial performance
CPU performance Operations per time unit
FLoating point OPerations per second (FLOPs)
Computes from
Numbers of pipelines
Operations per pipeline
Mean time unit per pipeline phase

Interested in
Peak performance (Maximal performance)
Mean performance (LinPack test)
Comparison to memory access time
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

Peak Performance
Processors

Clock rate 2,4 GHz

Number of pipelines
# ALUs (Arithmetic Logical Units)
Operations per pipeline
#OPsPerALU (generally 2 operations at the same time)
Mean time unit per pipeline step
1
Clock rate CPU, 2,4 GHz 2,410
9 Sek. 417 psec

Peak Performance
Rpeak =

#ALUs #OPpALU
22
=
9, 6GFLOPs
meantimeunit
417psec
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Storage clock rate vs. CPU clock rate


Time for fetching data from memory
DRR3-1600 DRAM
Memory clock 400 MHz,

(Double Data Rate Dynamic RAM)


bus clock 1600 MHz

Deliver the address via bus to memory ( 1 clock of bus)


Memory latency time
( 4 clock rate of memory speed)
(Time between address access and data fetch)
Refreshing memory after reading/writing data
( 20 40 clock rate of memory speed)
Transporting the data from memory
( 1 clock of bus)
1+1
8
30
+
+
96 nsec
6
6
10 }
106}
|1600{z 10 }
|400{z
|400{z
Bustime

Memorylatencytime

Refreshing

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Memory clock rate vs CPU clock rate


Computer
2,4 GHz, DRR3-1600 RAM, 1600 FSB 400 DRAM
Peak Performance

Rpeak = 9, 6 GFLOPS

Clock rate 417 psec


Operation (Reading a datum)
96 nsec
96 nsec

230
417 psec

That means
230 CPU clock cycles takes reading/fetching of a datum
or 920 operations could be executed during this time
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Classification of parallel architectures


Taxonomy after Flynn
Michael J. Flynn (*1934) Professor Emeritus, Stanford Univ.
Classification of computer architecture after instruction and
data streams

Single
Multiple
Instruction Instruction
Single Data
SISD
MISD
Multiple Data
SIMD
MIMD
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

SISD Single Instruction Single Data


Hardware
Sequential computer
Classical hardware with one ALU
Princeton architecture (v.Neumann)
Harvard architecture

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

SIMD Single Instruction Multiple Data


Hardware
Parallel system architecture
Modern CPUs with several ALUs
Data are stored in vector registers
Each instruction operates on one vector registers
Further development of Harvard architecture

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

SIMD Single Instruction Multiple Data


Hardware
Array- or vector processor (SIMD with parallel registers)
Clock synchronized steering for execution of one
instruction on different data (e.g. Addition of two vectors)
Example: Cray-1 (1976)
80 MFLOPS, 64 vector reg.
Weight: 5.5 tons
Los Alamos National Lab.
(Atom test computation)
European Centre for Medium
Range Weather Forecasts
(10 days weather forecast)
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43

SIMD Single Instruction Multiple Data


Example: Thinking Machines
Connection Machine
CM-1 (1985)
201 MFlops
65536 1-Bit CPUs
with own 4K-Bit
memory
Clock rate 4 MHz
SIMD Machine
CM-5 (1991) MISD
architecture
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

MISD Multiple Instruction Single Data


Parallel or serial system architecture
Used for redundant calculations or to obtain different
solutions for one problem
Sometimes this class is considered empty
Examples
chess processors
optimization processors

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45

MIMD Multiple Instruction Multiple Data


Parallel system architecture
Execution units working independent and asynchronous
Processors executing different instructions on different
data
Example: modern cluster systems

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46

Parallel memory organization


DMM Distributed Memory Machines

SMM Shared Memory Machines

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47

DMM - Distributed Memory Machines


Consists of several nodes
Connection of these nodes via network
Nodes consist of
Several Processors/Cores
local memory
network controller (I/O)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48

Communication on DMMs
Local memory is private
Message exchange/communication via network

Comm. btw. cooperating sequential processes


Processes PA , PB run on different nodes A& B
PA sends message to PB :
PA process send instruction with message & target PB
PB process receive instruction with declaration of receive
buffer and source process PA

Need for message-passing programming model


Standard: Message Passing Interface (MPI)
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49

SMM Shared Memory Machines


Several cores in processor
shared or global memory
unified address space
On-board connection between processors and memory
(not necessarily network connection)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50

SMM Shared Memory Machines


Access to common/global variables
avoid concurrent uncoordinated access/write
from different processes
Advantages SMM
easier to program
less communication
better memory usage
Disadvantages SMM
Needs high bandwidth on network
(in Order to realize fast memory access)
difficult to realize for a great number of processors
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 51

Shared Memory Programming


Shared memory programming can be done on a system
which has more than one computing units (core) sharing
the same physical memory.
The data between dierent computing units is shared in the
form of shared variables.
There are many tools (Application Programming Interfaces
or API) like OpenMP, pthreads and Intel Threading Blocks
(ITBB) available for shared memory programming.
Note that shared address space model is different from
shared memory model.

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 52

Das könnte Ihnen auch gefallen