pc5 Hardw PDF

Parallel Computing I
Parallel Hardware Architecture

Thorsten Grahs, 11. May 2015
Table of contents
Foundation
Hardware architecture
Pipelining
Serial performance
Classification of parallel hardware
11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Moores law
In 1965 Gordon Moore discovered a trend in the number
of transistor per chip
He discovered a doubling every second year.
This observation has become known as
Moores law
Trend in # of transistors
Year 1971
1000
(103 ) transistors per Core
Year 2011
1000 Mill. (109 ) transistors per Core
=
Six orders of magnitude over 40 years, i.e.
factor 1.995 every 2 year
Transistors per Processor over Time
Continues to grow exponentially (Moores Law)

Performance growth
So why worry about parallel computing?
If you need performance improvement,
spare time parallelising you problem
Wait two years, spend your time with fishing (o.s.e)
Together with increasing clock rate & Moores law
run on a faster computer
Increasing clock rate?
well...
Clock rate growth halted around 2005
Processor Clock Rate over Time
Stops to grow exponentially (Power wall)

Performance Barriers
The free lunch is over!
Several barriers to grow perform. by clock rate & transistors
Memory wall
Growing disparity of speed between CPU and memory
outside the CPU chip. Reason for this is the limited
communication bandwidth beyond chip boundaries.
Power wall Clock frequency cannot be increased without
exceeding air cooling
Instruction-level-parallelism (ILP)
Limits to available low-level-parallelism
ILP is a measure of how many of the operations in a computer
program can be performed simultaneously.
Power wall
Memory wall
Problem to bring the data fast enough to the CPU
Limiting factors
Bandwidth defines the max.
throughput of a communication
path in a digital system.
Measured in Hz.
latency describes the delay
time between request and
recieve of data from RAM,
Storage density is the physical space occupied by
memory
Storage density
Why problematic?
Data cant get fast enough to the CPU
Light travels 0.3 mm in 1012 sec.
Memory must be located in a radius of 0.3 mm around
the CPU to fetch all data in the right time.
Example weather forecast (first lecture)
Simulation stores 20 numbers p. grid point
(Temp., Velocity, pressure, k, . . . )
32 Bit p. number, 1010 grid point 6.4 1012 Bit
Storage density 1 Bit per atom for suitable memory
Need for explicit parallel mechanisms

Conclusion
Requirements for performance scaling
explicit parallel mechanisms
explicit parallel programming
Outcome
Many-core cpus
Parallel programming
Distributes memory computing
Cluster computer/GPGPUs
The answers to which question?

Why not using one big master mind
If you were ploughing a field , what would you rather use?
Two strong oxen or 1024 chickens?
Seymour Cray(1925 1996)
Founder of Cray Inc..
Answer to Seymour Cray
(and foundation of distributed memory computing)
Because this is unflexible and sedate!

To pull a bigger wagon, it is easier to add more oxen
than to grow a gigantic oxen
W. Groppe, E. Lusk, A. Skjellum in Using MPI
Example: Distributed memory machine

Building a cluster
(i.e. a distributed memory machines) Idea
Use f.i. 1000 standard PCs with 109 -FLOPs
Domain Decomposition
Each processor is responsible for 106 grid points
Distance Memory-CPU could be up to 300 mm
Memory requirement per CPU 800 MByte
Circumvention for
Storage density problem
Memory wall
Power wall
ILP wall
Sequential computing
Classical hardware with
undivided memory
stores program & data
Central processing unit
(CPU) that executes the
instructions,
operating on the data
Performance thats whats all about!

GFLOPs = GHz? Holds for vector or super scalar processors,
but only for ideal situations (peak performance)
For practical applications peak perform. is hard to access.
von Neumann architecture

John von Neumann (1903 1957)
german/hungarian mathematician
since 1933 Princeton
Institute for Advanced Studies
Work related to mathematics, quantum
physics and computer science
Electronic Numerical Integrator And Computer (ENIAC)
program (Univ. Pennsylvania) with J. Eckert & J. Mauchly
Electronic Discrete Variable Automatic Computer
EDVAC program for the US-Army: Ballistic calculation
First Draft of a Report on the EDVAC, 1945
vonNeumann architecture II
Reference model for a sequential computer
Instructions & Data are located in one memory
First universal programmable computer
(before: punch cards/hardware)
Harvard architecture
(Parallel) Harvard architecture
Reference model for fast CPUs (Pipelining)
Instructions & data located in different memories
Advantages
Instruction/data can be handled simultaneously
separated data & instruction bus
RISC (Reduced Instruction Set Computer)
Avoids complex instruction sets which combines memory
access (slow) with arithmetic operations (fast)
Harvard architecture II
Handles instruction & datum in 1 cycle (parallel)
where v.Neumman architecture reads serial (2 cycles)
Example | Vector add

Task list for sequential computing
Vector addition
#define N 1000000
unsigned long i;
double a[N];
double sum=0.0;
.
.
.
for (i=0; i<N; ++i)
sum=sum+a[i];
Fetch next instruction from

memory into register
Interpretation of the instruction
Load first argument (sum) from
memory into register
Load 2nd argument (a[i]) from
memory into additional register
Execute instruction and store
result into third register
Write result (sum) back to memory
Von-Neumann bottle neck

Performance
Main barrier is the memory
DRAM has f.i. 1066 MHz
Modern CPUs 100 faster, i.e. memory.
thwart computat. power of serial computer
Remedy
Fast memory buffers to avoid (re) access
to slower main memory
Cache hierarchy L1,L2,L3-Cache
RISC & pipelining
Performance comparison
Linpack-Benchmark
BLAS (Basic Linear Algebra Subroutines)
Matrix size (Number of Variables)
Complex Instruction Set Computers CISC

CISC (used before 1985)
Simpler programming with complex operations in one
instruction (software was expensive)
Bounded memory size favour for short programs
(complex instructions compact programs)
Disadvantage
Complex instruction implies slower clock rate
Pipelining is harder to implement
Longer development cycles
error prone
normally only 30% of the instructions used
Reduced Instruction Set Computers RISC

RISC (used since 1985)
Memory became cheaper and faster
Decoding and executing instructions became the limiting
factor
Advantage
Simpler software design
Easier to debug
Shorter development cycles
cheaper to produce
faster execution
more potential for optimizations
(many dependencies between complex instructions)
Why RISC is faster?

CISC short programs with complex instruction sets
RISC long programs with simple instructions
Methods used in RISC computers
Instruction pipelines
parallel execution of instruction sets
(fetch, decode, execute, write)
Cached memory
small but fast memory buffer between
main memory and CPU
super scalar execution
instruction agglomeration for parallel execution
Pipelining
Decomposition of tasks into several steps
Model: Assembly line car production
Pipelining | Example: laundry I

Anna, Britta, Caspar & Daniel:
Each of them have a pack of dirty
laundry to wash, dry and fold
Washing takes 30 minutes
Drying takes 40 minutes
Folding takes 20 minutes
Pipelining | Example: laundry II
Sequential execution
(6 h for 4 laundry heaps)
Pipelining | Example: laundry III
Pipelining
(3,5 h for 4 laundry heaps)

Pipelining | Example: laundry IV

Pipelining increases
through put (Op./time)
Depends on longest
operation step
Unbalanced length of
steps reduce speed-up
Potential speed-up
= # single operations
Pipelining for laundry
(3,5 h for 4 laundry heaps)
Pipelining | Logical steps

Logical steps in a computer
IF Instruction Fetch
OF Operand Fetch
OS Operand Store
ID Instruction Decode
EX Execution
Pipelining | Logical steps II

Logical instruction phases
IF Instruction Fetch
OF Operand Fetch
OS Operand Store
ID Instruction Decode
EX Execution
Problems Pipelining dependencies

Data dependencies
One instruction may depend on results from the former
instruction
a=x+y;
b=a*c;
Computation of b must wait for end of computation of a
Releasing data dependency:
Computation of a can be executed in forehand
Out of order execution

Compiler changes execution order to solve data
dependencies
Problems Pipelining branches

Conditional control flow
Control flow/branches depends on former instructions
if(d>2)
{x=0.;}
else
{x=1.;}
Branch prediction
Hardware predicts most probable branch, computes
instruction theoretical,
Problems Pipelining Inlining

Unconditional control flow
The control flow has to be interrupted
double mult(a,b) {return(a*b);}
...
for(i=0;i<n;i++) x[i]=mult(x[i],1+1./2.);
Inlining
Pre-processor substitutes function call
#define mult(a,b) {(a)*(b)}
...
for(i=0;i<n;i++) x[i]=mult(x[i],1+1./2.);
Problems Pipelining loop unrolling

Unconditional control flow
The control flow has to be interrupted
for(i = 0; i < n; i++)
x[i] = a x[i];
Inlining
Unrolling the loop Increasing the loop counter
for(i = 0;
x[i] = a
x[i + 1]
x[i + 2]
x[i + 3]
=
=
=
< n; i+=4){
x[i];
a x[i + 1];
a x[i + 2];
a x[i + 3];
Serial performance
CPU performance Operations per time unit
FLoating point OPerations per second (FLOPs)
Computes from
Numbers of pipelines
Operations per pipeline
Mean time unit per pipeline phase
Interested in
Peak performance (Maximal performance)
Mean performance (LinPack test)
Comparison to memory access time
Peak Performance
Processors
Clock rate 2,4 GHz
Number of pipelines
# ALUs (Arithmetic Logical Units)
Operations per pipeline
#OPsPerALU (generally 2 operations at the same time)
Mean time unit per pipeline step
1
Clock rate CPU, 2,4 GHz 2,410
9 Sek. 417 psec
Peak Performance
Rpeak =
#ALUs #OPpALU
22
=
9, 6GFLOPs
meantimeunit
417psec
Storage clock rate vs. CPU clock rate

Time for fetching data from memory
DRR3-1600 DRAM
Memory clock 400 MHz,
(Double Data Rate Dynamic RAM)

bus clock 1600 MHz
Deliver the address via bus to memory ( 1 clock of bus)

Memory latency time
( 4 clock rate of memory speed)
(Time between address access and data fetch)
Refreshing memory after reading/writing data
( 20 40 clock rate of memory speed)
Transporting the data from memory
( 1 clock of bus)
1+1
8
30
+
+
96 nsec
6
6
10 }
106}
|1600{z 10 }
|400{z
|400{z
Bustime
Memorylatencytime
Refreshing
Memory clock rate vs CPU clock rate

Computer
2,4 GHz, DRR3-1600 RAM, 1600 FSB 400 DRAM
Peak Performance
Rpeak = 9, 6 GFLOPS
Clock rate 417 psec

Operation (Reading a datum)
96 nsec
96 nsec
230
417 psec
That means
230 CPU clock cycles takes reading/fetching of a datum
or 920 operations could be executed during this time
Classification of parallel architectures

Taxonomy after Flynn
Michael J. Flynn (*1934) Professor Emeritus, Stanford Univ.
Classification of computer architecture after instruction and
data streams
Single
Multiple
Instruction Instruction
Single Data
SISD
MISD
Multiple Data
SIMD
MIMD
SISD Single Instruction Single Data

Hardware
Sequential computer
Classical hardware with one ALU
Princeton architecture (v.Neumann)
Harvard architecture
SIMD Single Instruction Multiple Data

Hardware
Parallel system architecture
Modern CPUs with several ALUs
Data are stored in vector registers
Each instruction operates on one vector registers
Further development of Harvard architecture

Hardware
Array- or vector processor (SIMD with parallel registers)
Clock synchronized steering for execution of one
instruction on different data (e.g. Addition of two vectors)
Example: Cray-1 (1976)
80 MFLOPS, 64 vector reg.
Weight: 5.5 tons
Los Alamos National Lab.
(Atom test computation)
European Centre for Medium
Range Weather Forecasts
(10 days weather forecast)

Example: Thinking Machines
Connection Machine
CM-1 (1985)
201 MFlops
65536 1-Bit CPUs
with own 4K-Bit
memory
Clock rate 4 MHz
SIMD Machine
CM-5 (1991) MISD
architecture
MISD Multiple Instruction Single Data

Parallel or serial system architecture
Used for redundant calculations or to obtain different
solutions for one problem
Sometimes this class is considered empty
Examples
chess processors
optimization processors
MIMD Multiple Instruction Multiple Data

Parallel system architecture
Execution units working independent and asynchronous
Processors executing different instructions on different
data
Example: modern cluster systems
Parallel memory organization

DMM Distributed Memory Machines
SMM Shared Memory Machines
DMM - Distributed Memory Machines

Consists of several nodes
Connection of these nodes via network
Nodes consist of
Several Processors/Cores
local memory
network controller (I/O)
Communication on DMMs
Local memory is private
Message exchange/communication via network
Comm. btw. cooperating sequential processes

Processes PA , PB run on different nodes A& B
PA sends message to PB :
PA process send instruction with message & target PB
PB process receive instruction with declaration of receive
buffer and source process PA
Need for message-passing programming model

Standard: Message Passing Interface (MPI)

Several cores in processor
shared or global memory
unified address space
On-board connection between processors and memory
(not necessarily network connection)

Access to common/global variables
avoid concurrent uncoordinated access/write
from different processes
Advantages SMM
easier to program
less communication
better memory usage
Disadvantages SMM
Needs high bandwidth on network
(in Order to realize fast memory access)
difficult to realize for a great number of processors
Shared Memory Programming

Shared memory programming can be done on a system
which has more than one computing units (core) sharing
the same physical memory.
The data between dierent computing units is shared in the
form of shared variables.
There are many tools (Application Programming Interfaces
or API) like OpenMP, pthreads and Intel Threading Blocks
(ITBB) available for shared memory programming.
Note that shared address space model is different from
shared memory model.

pc5 Hardw PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

pc5 Hardw PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Parallel Computing I

Parallel Hardware Architecture

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Transistors per Processor over Time

Continues to grow exponentially (Moores Law)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Processor Clock Rate over Time

Stops to grow exponentially (Power wall)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

Need for explicit parallel mechanisms

The answers to which question?

Because this is unflexible and sedate!

Example: Distributed memory machine

Performance thats whats all about!

von Neumann architecture

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Example | Vector add

Fetch next instruction from

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Von-Neumann bottle neck

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Complex Instruction Set Computers CISC

Reduced Instruction Set Computers RISC

Why RISC is faster?

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Pipelining | Example: laundry I

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Pipelining | Example: laundry II

(6 h for 4 laundry heaps)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Pipelining | Example: laundry III

(3,5 h for 4 laundry heaps)

Pipelining | Example: laundry IV

Pipelining for laundry

(3,5 h for 4 laundry heaps)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

Pipelining | Logical steps

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Pipelining | Logical steps II

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Problems Pipelining dependencies

Out of order execution

Problems Pipelining branches

Problems Pipelining Inlining

Problems Pipelining loop unrolling

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Clock rate 2,4 GHz

Storage clock rate vs. CPU clock rate

(Double Data Rate Dynamic RAM)

Deliver the address via bus to memory ( 1 clock of bus)

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Memory clock rate vs CPU clock rate

Clock rate 417 psec

Classification of parallel architectures

SISD Single Instruction Single Data

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

SIMD Single Instruction Multiple Data

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

SIMD Single Instruction Multiple Data

SIMD Single Instruction Multiple Data

MISD Multiple Instruction Single Data

11. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45