Sie sind auf Seite 1von 28

Supercomputer Architecture: The

TeraFLOPS Race

Stephen Jenks
Scalable Parallel & Dist. Systems Lab
EECS Colloquium Feb. 16, 2005

2/16/2005 1
Why Supercomputing?
† Some Problems Larger Than Single
Computer Can Process
„ Memory Space (>> 4-8 GB)
„ Computation Cost (O(n^3), for example)
„ More Iterations (100 years)
„ Data Sources (Sensor processing)
† National Pride
† Technology Migrates to Consumers

2/16/2005 2
Supercomputer Applications
† Weather Prediction
† Pollution Flow
† Fluid Dynamics
† Stress Analysis
† Protein Folding
† Chemistry Simulation
† Nuclear Simulation
† Equation Solving
† Code Breaking

2/16/2005 3
How Fast Are Supercomputers?
† The Top Machines Can Perform Tens of Trillions
Floating Point Operations per Second
(TeraFLOPS)
† They Can Store Trillions of Data Items in RAM!
† Example: 1 KM grid over USA
„ 4000x2000x100 = 800 million grid points
„ If each point has 10 values, and each value takes 10
ops to compute => 80 billion ops per iteration
„ If we want 1 hour timesteps for 10 years, 87600 iters
„ More than 7 Peta-ops total!

2/16/2005 4
How Fast is That?
† Cray-1 (1977)
„ 250 MFLOPS
„ 80 MHz
„ 1 MWord (64-bit)
† Orig PC 8088 (1979)
„ 5 MHz
„ 1 MB RAM
† Modern PC (Pentium 4)
„ 3 GHz
„ 6 GFLOPS
„ 4 GB RAM

2/16/2005 5
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html
Lies, Damn Lies, and Statistics
† Manufacturers Claim Ideal Performance
„ 2 FP Units @ 3 GHz => 6 GFLOPS
„ Dependences mean we won't get that much!
† How Do We Know Real Performance
„ Top500.org Uses High-Perf LINPACK
„ http://www.netlib.org/benchmark/hpl
„ Solves Dense Set of Linear Equations
„ Much Communications and Parallelism
„ Not Necessarily Reflective of Target Apps

2/16/2005 6
Who Makes Supercomputers?

2/16/2005 7
Supercomputer Architectures
† All Have Some Parallelism, Most Have Several
Types
„ Pipelining (Overlapping Execution of Several
Instructions)
„ Shared Address Space Parallelism
„ Distributed Memory (Multicomputer)
„ Vector or SIMD
† Almost All Use Single-Program, Multiple-Data
(SPMD) Model
„ Same Program Runs on All CPUs
„ Unique Identifier Per Copy

2/16/2005 8
Architecture Diagrams
Shared Address Space Distributed Memory
C C C C MEM MEM MEM MEM
P P P P
U U U U CPU CPU CPU CPU

NIC NIC NIC NIC


Shared Memory
Interconnection Network
Conceptual view only.
Real Shared Memory
Machines have
distributed memory
2/16/2005 9
#1: IBM Blue Gene/L
† Prototype System with Only 32768 CPUs
† Final System Will Have 4 Times That
† Each CPU is 700MHz
† Intended for Protein Folding and Massively
Parallel Simulations
† Achieved 70.72 TFLOPS
† Networks:
„ 3D Toroidal Mesh (350MB/sec times 6 links per node)
„ Gigabit Ethernet for storage
„ Combining Tree for Global Operations (Reduce, etc.)
„ Barrier/interrupt network
2/16/2005 10
Blue Gene/L Continued

2/16/2005 11
From Top500.org Website
#2: SGI Altix (NASA Columbia)
† 10240 Itanium 2 Processors Grouped Grouped in Clusters
of 512
„ 1.5 GHz, 6MB Cache
„ Shared memory within 512 CPU Cluster
„ 20TB Total Memory
† Runs Linux
† Networks
„ SGI NUMAlink (6.4GB/sec)
„ Infiniband (10Gb/sec, 4 microsecond latency)
„ 10 Gigabit Ethernet
„ 1 Gigabit Ethernet
† 51.87 TFLOPS

2/16/2005 12
Columbia Photo

2/16/2005 13
From NASA Ames Research Center Website
#3: Earth Simulator
† Was #1 for 3 Years Until Nov. 2004
† 5120 Processors
„ 640 Nodes with 8 Processors Each
„ 16 GB RAM per Node
„ NEC SX6 Vector Processors
† Full Crossbar Interconnect
„ Bidirectional 12.3GB/sec
„ 8TB/sec Total
† 35.86 TFLOPS
2/16/2005 14
Earth Simulator Pictures

Processing Interconnect
Node Node

2/16/2005 15
Pictures from http://www.es.jamstec.go.jp/esc/eng/ES/hardware.html
Beowulf Clusters
† Started as networks of
low-cost PCs
† Now, thousands of CPUS
„ Many single proc
„ Some dual proc or more
† Interconnection network
key to performance
„ Myrinet: 2Gbps, 10µs
„ InfiniBand: 10Gbps, 5µs
„ Quadrics: 9Gbps, 4µs
„ GigE: 1Gbps, 40µs

2/16/2005 16
Top Clusters
Name/Org CPUs Interconnect Rpeak Rmax
(GFLOPS) (GFLOPS
Barcelona 2563 Myrinet 31363 20530
Mare PPC970
Nostrum*
LLNL 4096 Quadrics 22938 19940
Thunder Itanium2
LANL ASCI 8192 Alpha Quadrics 20480 13880
Q
VA Tech 2200 InfiniBand 20240 12250
System X PPC970
* See SlashDot article today on building of MareNostrum
2/16/2005 17
From Top500.org Website
Top Machines Summary
100000
Actual GFLOPS
90000
Peak GFLOPS
80000
70000
60000
50000
40000
30000
20000
10000
0
B

Ea

Th

Sy
B
lu

ol

SC
ar

un
rt

st
eG

um

c
h

IQ

e
el

de

m
Si
en

bi

on

r
m

X
a
e/

a
L

2/16/2005 18
Cray X1 (Vector)
† Distributed Shared
Memory Vector
Multiprocessor
„ 4 CPUs per node
„ 800 MHz, 16 ops/cycle
„ 16 nodes/cabinet
† 819 GFLOPS
† 512 GB RAM
„ Up to 64 cabinets
† Modified 2D Torus
Interconnect
http://www.cray.com/products/x1/specifications.html

2/16/2005 19
Cray XD1 (Supercluster)
† Each Chassis
„ 12 Opterons
„ 2-way SMPs
„ 58 (Peak) GFLOPS
„ Virtex-II Pro FPGAs
„ RapidArray Interconnect
† Each Rack
„ 12 Chassis
„ RapidArray Interconnect
„ MPI Latency: 2.0 µsec

http://www.cray.com/downloads/Cray_XD1_Datasheet.pdf

2/16/2005 20
IBM Power Series
† 8 to 32 POWER4 or
POWER5 CPUs
„ Multi-chip packages
„ Simultaneous
Multithreading
† Multi-Gbps Interconnect
Between Components
† Pictured: UCI’s Earth
System Modeling Facility -
88 CPUs
„ 7x8 CPUs
„ 1x32 CPUs

2/16/2005 21
Trends
† What are the Trends, Based on Current
Machines?
† Commodity Processors
† Vector Machines Still Around
† Processors Moved Closer to Each Other
„ Nodes Composed of SMPs
„ From 2 to 512 CPUs share memory
† Interconnection Networks Getting Fast
„ But Not as Quickly as CPU Speed
† Machines Hot and Power Hungry
„ Exception: Blue Gene/L (1.2MW)
2/16/2005 22
Research Topics
† Programming Models
† Grid Computing
„ Combining resources/utility computing
† OptIPuter
„ High-Performance Computing, Storage, and
Visualization Resources Connected by Fiber
„ WDM allows dedicated lambdas per app.
„ UCSD (Larry Smarr-PI), UIC, USC, UCI

2/16/2005 23
Shared Memory Programming Model
† Shared Memory Programming Looks Easy
„ Threads: POSIX, OpenMP, etc.
„ Implicit Parallelism (OpenMP)
#pragma omp parallel for private (i,k)
for (i = 0; i < nx; i++)
for (k = 0; k < nz; k++) { /* front and black plates */
ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; …

† But Shared Resources Make Things Ugly


„ Shared Data => Locks
„ Memory Allocation => Hidden locks kill performance
„ Contention for Memory Regions
† So Many Shared Memory Machines are
Programmed as if they were Distributed
2/16/2005 24
Message Passing Programming Model
† Message Passing Interface (MPI)
„ High Performance, Relatively Simple
„ All Parallelism Managed by User
„ Explicit Send/Receive Operations
MPI_Isend(&AR_INDEX(ex, 0, lowy, 0) /*lowest plane on node*/, 1/*count*/,
XZ_PlaneType, neighbor_nodes[Y_DIMENSION][LOW_NEIGHBOR], TAG_EXXZ,
MPI_COMM_WORLD, &requestArray[count++]);

MPI_Irecv(&AR_INDEX(ex, 0, 0, highz + 1) /* one past highz point */,


1 /* count */, XY_PlaneType,
neighbor_nodes[Z_DIMENSION][HIGH_NEIGHBOR] /* source */,
TAG_EXXY, MPI_COMM_WORLD, &requestArray[count++]);

2/16/2005 25
Debugging
† Parallel debugging is mostly awful
„ 10s or 100s of program states
„ GDB for Threads is bad enough!
† Need way to capture and visualize program
state
„ Zero in on trouble spots
„ Deadlocks common

2/16/2005 26
Future Architecture Research
† IBM/Toshiba/Sony Cell Architecture
„ General Purpose CPU With SMT
„ SIMD Units with Fast RAM
„ Said to be comparable to Earth Simulator
Node
† Stream Processors (& Media Processors)
† Quantum Computing
† Fault Tolerance
† Power Consumption Awareness
2/16/2005 27
Conclusion
† Despite our Home Computers Being Faster
than Early Supercomputers
„ Many Supercomputers being built
„ Different architectures still abound
† Problem sizes getting larger
„ Finer meshes
„ More time steps
„ More precise calculations

2/16/2005 28

Das könnte Ihnen auch gefallen