Supercomputer Architecture: The Teraflops Race

Supercomputer Architecture: The
TeraFLOPS Race
Stephen Jenks
Scalable Parallel & Dist. Systems Lab
EECS Colloquium Feb. 16, 2005
2/16/2005 1
Why Supercomputing?
Some Problems Larger Than Single
Computer Can Process
Memory Space (>> 4-8 GB)
Computation Cost (O(n^3), for example)
More Iterations (100 years)
Data Sources (Sensor processing)
National Pride
Technology Migrates to Consumers
2/16/2005 2
Supercomputer Applications
Weather Prediction
Pollution Flow
Fluid Dynamics
Stress Analysis
Protein Folding
Chemistry Simulation
Nuclear Simulation
Equation Solving
Code Breaking
2/16/2005 3
How Fast Are Supercomputers?
The Top Machines Can Perform Tens of Trillions
Floating Point Operations per Second
(TeraFLOPS)
They Can Store Trillions of Data Items in RAM!
Example: 1 KM grid over USA
4000x2000x100 = 800 million grid points
If each point has 10 values, and each value takes 10
ops to compute => 80 billion ops per iteration
If we want 1 hour timesteps for 10 years, 87600 iters
More than 7 Peta-ops total!
2/16/2005 4
How Fast is That?
Cray-1 (1977)
250 MFLOPS
80 MHz
1 MWord (64-bit)
Orig PC 8088 (1979)
5 MHz
1 MB RAM
Modern PC (Pentium 4)
3 GHz
6 GFLOPS
4 GB RAM
2/16/2005 5
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html
Lies, Damn Lies, and Statistics
Manufacturers Claim Ideal Performance
2 FP Units @ 3 GHz => 6 GFLOPS
Dependences mean we won't get that much!
How Do We Know Real Performance
Top500.org Uses High-Perf LINPACK
http://www.netlib.org/benchmark/hpl
Solves Dense Set of Linear Equations
Much Communications and Parallelism
Not Necessarily Reflective of Target Apps
2/16/2005 6
Who Makes Supercomputers?
2/16/2005 7
Supercomputer Architectures
All Have Some Parallelism, Most Have Several
Types
Pipelining (Overlapping Execution of Several
Instructions)
Shared Address Space Parallelism
Distributed Memory (Multicomputer)
Vector or SIMD
Almost All Use Single-Program, Multiple-Data
(SPMD) Model
Same Program Runs on All CPUs
Unique Identifier Per Copy
2/16/2005 8
Architecture Diagrams
Shared Address Space Distributed Memory
C C C C MEM MEM MEM MEM
P P P P
U U U U CPU CPU CPU CPU
NIC NIC NIC NIC

Shared Memory
Interconnection Network
Conceptual view only.
Real Shared Memory
Machines have
distributed memory
2/16/2005 9
#1: IBM Blue Gene/L
Prototype System with Only 32768 CPUs
Final System Will Have 4 Times That
Each CPU is 700MHz
Intended for Protein Folding and Massively
Parallel Simulations
Achieved 70.72 TFLOPS
Networks:
3D Toroidal Mesh (350MB/sec times 6 links per node)
Gigabit Ethernet for storage
Combining Tree for Global Operations (Reduce, etc.)
Barrier/interrupt network
2/16/2005 10
Blue Gene/L Continued
2/16/2005 11
From Top500.org Website
#2: SGI Altix (NASA Columbia)
10240 Itanium 2 Processors Grouped Grouped in Clusters
of 512
1.5 GHz, 6MB Cache
Shared memory within 512 CPU Cluster
20TB Total Memory
Runs Linux
Networks
SGI NUMAlink (6.4GB/sec)
Infiniband (10Gb/sec, 4 microsecond latency)
10 Gigabit Ethernet
1 Gigabit Ethernet
51.87 TFLOPS
2/16/2005 12
Columbia Photo
2/16/2005 13
From NASA Ames Research Center Website
#3: Earth Simulator
Was #1 for 3 Years Until Nov. 2004
5120 Processors
640 Nodes with 8 Processors Each
16 GB RAM per Node
NEC SX6 Vector Processors
Full Crossbar Interconnect
Bidirectional 12.3GB/sec
8TB/sec Total
35.86 TFLOPS
2/16/2005 14
Earth Simulator Pictures
Processing Interconnect
Node Node
2/16/2005 15
Pictures from http://www.es.jamstec.go.jp/esc/eng/ES/hardware.html
Beowulf Clusters
Started as networks of
low-cost PCs
Now, thousands of CPUS
Many single proc
Some dual proc or more
Interconnection network
key to performance
Myrinet: 2Gbps, 10µs
InfiniBand: 10Gbps, 5µs
Quadrics: 9Gbps, 4µs
GigE: 1Gbps, 40µs
2/16/2005 16
Top Clusters
Name/Org CPUs Interconnect Rpeak Rmax
(GFLOPS) (GFLOPS
Barcelona 2563 Myrinet 31363 20530
Mare PPC970
Nostrum*
LLNL 4096 Quadrics 22938 19940
Thunder Itanium2
LANL ASCI 8192 Alpha Quadrics 20480 13880
Q
VA Tech 2200 InfiniBand 20240 12250
System X PPC970
* See SlashDot article today on building of MareNostrum
2/16/2005 17
From Top500.org Website
Top Machines Summary
100000
Actual GFLOPS
90000
Peak GFLOPS
80000
70000
60000
50000
40000
30000
20000
10000
0
B
Ea
Th
Sy
B
lu
ol
SC
ar
un
rt
st
eG
um
c
h
IQ
e
el
de
m
Si
en
bi
on
r
m
X
a
e/
a
L
2/16/2005 18
Cray X1 (Vector)
Distributed Shared
Memory Vector
Multiprocessor
4 CPUs per node
800 MHz, 16 ops/cycle
16 nodes/cabinet
819 GFLOPS
512 GB RAM
Up to 64 cabinets
Modified 2D Torus
Interconnect
http://www.cray.com/products/x1/specifications.html
2/16/2005 19
Cray XD1 (Supercluster)
Each Chassis
12 Opterons
2-way SMPs
58 (Peak) GFLOPS
Virtex-II Pro FPGAs
RapidArray Interconnect
Each Rack
12 Chassis
RapidArray Interconnect
MPI Latency: 2.0 µsec
http://www.cray.com/downloads/Cray_XD1_Datasheet.pdf
2/16/2005 20
IBM Power Series
8 to 32 POWER4 or
POWER5 CPUs
Multi-chip packages
Simultaneous
Multithreading
Multi-Gbps Interconnect
Between Components
Pictured: UCI’s Earth
System Modeling Facility -
88 CPUs
7x8 CPUs
1x32 CPUs
2/16/2005 21
Trends
What are the Trends, Based on Current
Machines?
Commodity Processors
Vector Machines Still Around
Processors Moved Closer to Each Other
Nodes Composed of SMPs
From 2 to 512 CPUs share memory
Interconnection Networks Getting Fast
But Not as Quickly as CPU Speed
Machines Hot and Power Hungry
Exception: Blue Gene/L (1.2MW)
2/16/2005 22
Research Topics
Programming Models
Grid Computing
Combining resources/utility computing
OptIPuter
High-Performance Computing, Storage, and
Visualization Resources Connected by Fiber
WDM allows dedicated lambdas per app.
UCSD (Larry Smarr-PI), UIC, USC, UCI
2/16/2005 23
Shared Memory Programming Model
Shared Memory Programming Looks Easy
Threads: POSIX, OpenMP, etc.
Implicit Parallelism (OpenMP)
#pragma omp parallel for private (i,k)
for (i = 0; i < nx; i++)
for (k = 0; k < nz; k++) { /* front and black plates */
ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; …
But Shared Resources Make Things Ugly

Shared Data => Locks
Memory Allocation => Hidden locks kill performance
Contention for Memory Regions
So Many Shared Memory Machines are
Programmed as if they were Distributed
2/16/2005 24
Message Passing Programming Model
Message Passing Interface (MPI)
High Performance, Relatively Simple
All Parallelism Managed by User
Explicit Send/Receive Operations
MPI_Isend(&AR_INDEX(ex, 0, lowy, 0) /*lowest plane on node*/, 1/*count*/,
XZ_PlaneType, neighbor_nodes[Y_DIMENSION][LOW_NEIGHBOR], TAG_EXXZ,
MPI_COMM_WORLD, &requestArray[count++]);
MPI_Irecv(&AR_INDEX(ex, 0, 0, highz + 1) /* one past highz point */,

1 /* count */, XY_PlaneType,
neighbor_nodes[Z_DIMENSION][HIGH_NEIGHBOR] /* source */,
TAG_EXXY, MPI_COMM_WORLD, &requestArray[count++]);
2/16/2005 25
Debugging
Parallel debugging is mostly awful
10s or 100s of program states
GDB for Threads is bad enough!
Need way to capture and visualize program
state
Zero in on trouble spots
Deadlocks common
2/16/2005 26
Future Architecture Research
IBM/Toshiba/Sony Cell Architecture
General Purpose CPU With SMT
SIMD Units with Fast RAM
Said to be comparable to Earth Simulator
Node
Stream Processors (& Media Processors)
Quantum Computing
Fault Tolerance
Power Consumption Awareness
2/16/2005 27
Conclusion
Despite our Home Computers Being Faster
than Early Supercomputers
Many Supercomputers being built
Different architectures still abound
Problem sizes getting larger
Finer meshes
More time steps
More precise calculations
2/16/2005 28

Supercomputer Architecture: The Teraflops Race

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Supercomputer Architecture: The Teraflops Race

Hochgeladen von

Copyright:

Verfügbare Formate

Supercomputer Architecture: The

NIC NIC NIC NIC

But Shared Resources Make Things Ugly

MPI_Irecv(&AR_INDEX(ex, 0, 0, highz + 1) /* one past highz point */,

Das könnte Ihnen auch gefallen

Supercomputer Architecture: The Teraflops Race

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Supercomputer Architecture: The Teraflops Race

Hochgeladen von

Copyright:

Verfügbare Formate

Supercomputer Architecture: The

NIC NIC NIC NIC

 But Shared Resources Make Things Ugly

MPI_Irecv(&AR_INDEX(ex, 0, 0, highz + 1) /* one past highz point */,

Das könnte Ihnen auch gefallen

But Shared Resources Make Things Ugly