01 History of Parallel Computing

Parallel Programming
1) History of parallel omputing
Introdu tion and denitions

Brief history of super omputers
Programming parallel omputers
1{1
Parallel programming 2009, Ver. 2.0
Copyright 2007{2009 Andrea Di Blas
Why faster omputers?
Solve ompute-intensive problems faster

{ Make infeasible problems feasible
{ Redu e design time
Solve larger problems in same amount of time

{ Improve answer's pre ision
{ Redu e design time
Gain ompetitive advantage
End of Moore's Law \free lun h for software"
1{3
Denitions
Parallel omputing: Using parallel omputers to solve single problems

faster.
Parallel omputer: Multiple-pro essor system supporting parallel

exe ution.
Parallel programming: Programming parallel omputers. Two ways:

{ Expli it
{ Impli it
1{4
History of parallel omputing
Military-driven evolution of (super) omputing
World War II
{ Hand- omputed artillery tables: ENIAC (USA)
{ Break Nazi odes: Bombe, Colossus (UK)
Cold War
{ Nu lear weapons design
{ Air raft, submarine, et . design
{ Intelligen e gathering
{ Code breaking
1{5
ENIAC, 1943
E kert and Mau hly build the ENIAC (Ele troni Numeri al Integrator And
Cal ulator) | the rst stored-program \ele troni omputer"
1{6
The rst attempt at a super omputer: Illia -IV, 1966-1976
Linear array of 256 64-bit Pro essing Elements, ECL

Target: 1 GFLOP, 13 MHz
Programmed in \GLYPNIR", a ve torized derivative of ALGOL 60
1{7
The rst real super omputer: Seymour Cray's CRAY-1, 1976
S alar+ve tor pro essor, 80 MHz lo k, 133 MFLOPS, 8 MB main

memory in bipolar te hnology (ECL), $5 to $8+ million
150 kW motor generator, 20-ton ompressor for freon ooling system

Programmed in CFT, Cray Fortran Compiler, ve torized DO loops.
1{8
Commer ial super omputing
Started in apital-intensive industries:

{ Petroleum exploration
{ Automobile, air raft manufa turing
Today
{ Consumer produ ts
{ Pharma euti al design
{ Cir uit simulation
{ ...
1{9
Mi ropro essor-based super omputers: Calte h's Cosmi Cube

(1981)
64-node hyper ube based on Intel 8086 + 8087, 128 KB RAM per node
8 MHz, 10 MFLOPS, $80,000
Programmed in Pas al or C, with message-passing library.
1{10
A new model: Thinking Ma hines' CM-1
Tried to model the human brain: variable- onne tivity 12-D hyper ube
65,536 1-bit pro essing elements, 4 Kbit (CM-1) or 64 Kbits (CM-2) per
pro essor. 2,500 MIPS and 2,500 MFLOPS (CM-2)
Programmed in *Lisp, C* or CM Fortran

1{11
A popular massively-parallel SIMD omputer: MasPar MP-2

(1993)
2-D mesh, up to 16K 1-bit (MP-1) and 32-bit (MP-2) pro essors
Full- edged SIMD, with Xnet and global router ommuni ation
Programmed in MPL (MasPar Language) and HPF (High-Performan e
Fortran)
1{12
Commodity lusters: NASA's Beowulf luster (1994)
16 Intel 486DX PCs onne ted with standard 10 Mb/s Ethernet

Linux with MPI
1 GFLOP on a $50,000 system
1{13
Massively-parallel SIMD opro essors: UCSC Kestrel (1999)
512-PE linear SIMD array, 8-bit Pro essing Elements (PE), 20 MHz
64 PEs/ hip, 0.5 m CMOS (HP), 256 bytes SRAM per PE
30 GOPs (integer, 8-bit), 1 W peak power per hip
1{14
Today: IBM BlueGene/L
64K nodes (32 32 64) in a 3-D torus, 2 PPC 440 at 700 MHz per node
360 TFLOPS peak (World's Fastest Super omputer)
Starting at only $1.5 million per ra k (1024 nodes)
1{15
Tomorrow: IBM Roadrunner
1.6 PFLOPS peak (1.0 Linpa k PFLOPS)

Hybrid AMD Opteron + IBM Cell, 16 K nodes
Being delivered to Los Alamos National Lab, fully operational in 2008
1{16
Super omputer manufa turers

Today are:
IBM
NEC
Cray In .
Dell
Hewlett-Pa kard
Sun Mi rosystems
Sili on Graphi s
1{17
Yesterday: IBM 7044
Solid-state (transistors), 36-bit words, 32K addressing spa e

Fixed-point and oating-point
1{18

Seeking on urren y:
Data parallelism
Fun tional parallelism
1{19
Data dependen e graphs
P
Q
T
1{20
= (X + Y) * (X - Y)
= Z - W
= P + Q
Data parallelism
Independent tasks apply same operation to dierent data.

Example:
for(i = 0; i < 100; ++i)
a[i = b[i + [i
OK to perform operations on urrently
1{21
Fun tional parallelism
Independent tasks apply dierent operations to dierent data.

Example:
a = 2;
b = 3;
m = (a + b) / 2;
s = (a*a + b*b) / 2;
v = s - m;
First and se ond statements

Third and fourth statements
1{22

Four possible ways:
Extend ompilers to translate sequential programs into parallel ode

automati ally (\impli itly parallel")
Extend languages with new operations to express parallelism (\expli itly

parallel")
Add new parallel language layer on top of existing sequential language

Dene a totally new parallel language and ompiler system
1{23
Strategy 1: Extend ompilers

Let the ompiler dis over parallelism and produ e exe utable ode.
Advantages:
Easiest to use | doesn't require any spe i parallel programming training

Leverage billions of lines of existing (Fortran) ode
Disadvantages:
Parallelism may be lost when programs are formulated in a sequential

fashion
Performan e of parallelizing ompilers still poor on generi appli ations
1{24
Strategy 1: Example
www.parallelsp. om
Translates serial FORTRAN sour e ode into parallel sour e ode
1{25
Strategy 2: Extend language

Let the programmer reate, terminate, and syn hronize pro esses, and dene
all ommuni ations to expli itly en ode parallelism.
Advantages:
Easiest, qui kest, and least expensive to implement.

Leverage existing ompiler te hnology
New libraries ready soon after new parallel omputers are available
Disadvantages:
La k of ompiler support to at h errors

Easy to write programs hard to debug
Harder to learn
1{26
Strategy 2: Examples
All most popular parallel programming tools belong to this lass:
Message-Passing Interfa e (MPI)

Open spe i ations for Multi-Pro essing (OpenMP)
POSIX Threads (Pthreads)
Parallel Virtual Ma hine (PVM)
1{27
Strategy 3: Two-layer approa h

View ea h parallel program as made of two layers:
Lower layer:
{ Single-pro ess omputation ( ore of the omputation)
{ Expressed in any sequential programming language
Uppper layer:
{ Creation and syn hronization of pro esses
{ Partitioning of data among pro esses
Only resear h prototypes so far.
1{28
Strategy 3: Example
The CODE Proje t at

University of Texas
www. s.utexas.edu/users/ ode/
Visually glue C fun tions in parallel using Pthreads, MPI, or PVM

1{29
Strategy 4: Create a parallel language

Two approa hes:
Create a parallel language from s rat h

Add parallel onstru ts to an existing language: Fortran 90,
High-Performan e Fortran (HPF), C* (Thinking Ma hines Corp.)
Advantages:
Program with parallelism in mind (higher performan e)
Disadvantages:
Requires new languages and new ompilers

Programmers' resistan e
1{30
Strategy 4: Examples
INMOS' O am language
High-Performan e FORTRAN
SISAL data ow language
1{31
Current status
Low-level approa h is most popular:
Augment existing languages with low-level parallel onstru ts

MPI, OpenMP, and Pthreads
Advantages:
E ien y
Portability
Disadvantages:
Harder to program
Harder to debug
1{32
MPI
MPI = \Message-Passing Interfa e"

Expli itly-parallel programming strategy
Standard spe i ation for message-passing API
(Free) Libraries available on virtually all parallel omputers, in luding
networks of workstations and ommodity lusters
Libraries available for C/C++ and Fortran

Assumes distributed-memory systems:
CPU
CPU
CPU
Cache
Cache
Cache
Memory
I/O
devices
Memory
I/O
devices
Memory
I/O
devices
INTERCONNECTION NETWORK
1{33
OpenMP
OpenMP = \Open spe i ations for Multi Pro essing"

In rementally expli it parallel programming strategy
An appli ation-program interfa e (API) for multi-threaded,
shared-memory systems
Set of ompiler dire tives and runtime library routines

Available for C/C++ and Fortran
Not meant for distributed-memory, but only for shared-memory systems:
CPU
CPU
CPU
CPU
Cache
Cache
Cache
Cache
BUS
MAIN
MEMORY
1{34
I/O Dev.
Pthreads
Pthreads = \POSIX Threads"

Expli it parallel programming language
An appli ation-program interfa e (API) for multi-threaded,
shared-memory systems
Set of library routines for C/C++ only

Not meant for distributed-memory, but only for shared-memory systems:
CPU
CPU
CPU
CPU
Cache
Cache
Cache
Cache
BUS
MAIN
MEMORY
1{35
I/O Dev.
Pra ti e
Pra ti e problem 1.A

Card sorting:
1. How long does it take one person to sort a shued de k of ards?
2. How long does it take p people to sort p de ks of ards?
3. How long does it take p people to sort one de k of ards?
4. What is the optimal number of people?
1{36
Pra ti e
Pra ti e problem 1.B

You have 1000 ards, ea h with a number on, and you an use up to 1000
a ountants (ready at their desks in a 40 25 ave) to add them all up.
1. How do you do it?
2. How long does it take?
3. Where are you?
4. Can they do it 1000
times faster than you?
1{37

01 History of Parallel Computing

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

01 History of Parallel Computing

Hochgeladen von

Copyright:

Verfügbare Formate

Parallel Programming

1) History of parallel omputing

 Introdu tion and de nitions

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

Introdu tion and de nitions

Why faster omputers?

 Solve ompute-intensive problems faster

 Solve larger problems in same amount of time

 Gain ompetitive advantage

End of Moore's Law \free lun h for software"

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

Introdu tion and de nitions

 Parallel omputing: Using parallel omputers to solve single problems

 Parallel omputer: Multiple-pro essor system supporting parallel

 Parallel programming: Programming parallel omputers. Two ways:

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

Military-driven evolution of (super) omputing

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

The rst attempt at a super omputer: Illia -IV, 1966-1976

 Linear array of 256 64-bit Pro essing Elements, ECL

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

The rst real super omputer: Seymour Cray's CRAY-1, 1976

 S alar+ve tor pro essor, 80 MHz lo k, 133 MFLOPS, 8 MB main

 150 kW motor generator, 20-ton ompressor for freon ooling system

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

Commer ial super omputing

 Started in apital-intensive industries:

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

Mi ropro essor-based super omputers: Calte h's Cosmi Cube

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

A new model: Thinking Ma hines' CM-1

 Programmed in *Lisp, C* or CM Fortran

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

A popular massively-parallel SIMD omputer: MasPar MP-2

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

Commodity lusters: NASA's Beowulf luster (1994)

 16 Intel 486DX PCs onne ted with standard 10 Mb/s Ethernet

Parallel programming 2009, Ver. 2.0

Copyright 2007{2009 Andrea Di Blas

History of parallel omputing

Massively-parallel SIMD opro essors: UCSC Kestrel (1999)

Parallel programming 2009, Ver. 2.0

Introdu tion and denitions

Introdu tion and denitions

Solve ompute-intensive problems faster

Solve larger problems in same amount of time

Gain ompetitive advantage

Introdu tion and denitions

Parallel omputing: Using parallel omputers to solve single problems

Parallel omputer: Multiple-pro essor system supporting parallel

Parallel programming: Programming parallel omputers. Two ways:

Linear array of 256 64-bit Pro essing Elements, ECL

S alar+ve tor pro essor, 80 MHz lo k, 133 MFLOPS, 8 MB main

150 kW motor generator, 20-ton ompressor for freon ooling system

Started in apital-intensive industries:

Programmed in Lisp, C or CM Fortran

16 Intel 486DX PCs onne ted with standard 10 Mb/s Ethernet

1.6 PFLOPS peak (1.0 Linpa k PFLOPS)

Solid-state (transistors), 36-bit words, 32K addressing spa e

Independent tasks apply same operation to dierent data.

OK to perform operations on urrently

Independent tasks apply dierent operations to dierent data.

First and se ond statements

Extend ompilers to translate sequential programs into parallel ode

Extend languages with new operations to express parallelism (\expli itly

Add new parallel language layer on top of existing sequential language

Easiest to use | doesn't require any spe i parallel programming training

Parallelism may be lost when programs are formulated in a sequential

Performan e of parallelizing ompilers still poor on generi appli ations

Translates serial FORTRAN sour e ode into parallel sour e ode

Easiest, qui kest, and least expensive to implement.

La k of ompiler support to at h errors

Message-Passing Interfa e (MPI)

Only resear h prototypes so far.