BillDally NVIDIA SC12

SC12 11/14/12
Titan: Worlds #1 Open Science Supercomputer
18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack
SC12 11/14/12
Titan
18,688 Kepler GK110 27 PF peak (90% from GPUs) 17.6PF HP Linpack 2.12 GF/W GK110 is 7GF/W
SC12 11/14/12 3
The Road to Exascale
You are Here

2020 1000PF (50x) 72,000HCNs (4x) 20MW (2x) 50GFLOPs/W (25x) ~109 Threads (100x)
SC12 11/14/12
2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads
Technical Challenges on The Road to Exascale

1. Energy Efficiency
SC12 11/14/12

1. Energy Efficiency 2. Parallel Programmability
SC12 11/14/12

1. Energy Efficiency 2. Parallel Programmability 3. Resilience
SC12 11/14/12
Energy Efficiency
SC12 11/14/12
Moores Law to the Rescue?
SC12 11/14/12
Moore, Electronics 38(8) April 19, 1965

9
Unfortunately Not!
SC12 11/14/12
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
10
Chips are now power, not area limited
P=150W
P=5W
Perf (Ops/s) = P(W) * Eff(Ops/J) Process is improving Eff by 15-25% per node 2-3x in 8 years
SC12 11/14/12 11
We need 25x energy efficiency 2-3x will come from process 10x must come from Architecture and Circuits
SC12 11/14/12
12
Energy Efficiency
4 Process Nodes 6GFLOPS/W (3x)
50GFLOPs/W (25x)
2GFLOPs/W
Integrate Host 12GFLOPs/W (2x)
SC12 11/14/12
13
The High Cost of Data Movement

Fetching operands costs more than computing on them 20mm
64-bit DP 20pJ 256-bit buses 256-bit access 8 kB SRAM
26 pJ
256 pJ
16 nJ 500 pJ
DRAM Rd/Wr Efficient off-chip link
50 pJ
1 nJ
SC12 11/14/12
28nm
14
Low-Energy Signaling
SC12 11/14/12
15
Energy cost with efficient signaling

20mm 64-bit DP 20pJ
3 pJ
30 pJ
1 nJ 100 pJ
DRAM Rd/Wr Efficient off-chip link
256-bit buses
256-bit access 8 kB SRAM
50 pJ
100 pJ
SC12 11/14/12
28nm
16
Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x) 50GFLOPs/W (25x)
2GFLOPs/W
SC12 11/14/12
17
An Out-of-Order Core
Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)
SC12 11/14/12
4/11/11
Milad Mohammadi
18
18
SM Lane Architecture
Control Path
Net
Data Path
Thread PCs
RF L0Addr L1Addr Net
LM Bank 0
To LD/ST
RF L0Addr L1Addr Net
LM Bank 3
To LD/ST
Scheduler
Active PCs
L0 I$
ORF
Inst
ORF
ORF
64 threads 4 active threads 2 DFMAs (4 FLOPS/clock) ORF bank: 16 entries (128 Bytes) L0 I$: 64 instructions (1KByte) LM Bank: 8KB (32KB total)
FP/Int
FP/Int
LS/BR
SC12 11/14/12
19
Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x) 50GFLOPs/W (25x) Efficient Microarchitecture 48GFLOPs/W (2x)
2GFLOPs/W
SC12 11/14/12
20
Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x)
Optimized Voltages Efficient Memory Locality Enhancement (??x) 50GFLOPs/W (25x)
2GFLOPs/W
Efficient Microarchitecture 48GFLOPs/W (2x)
SC12 11/14/12
21
Parallel Programmability
SC12 11/14/12
22
Parallel programming is not inherently any more difficult than serial programming
However, we can make it a lot more difficult
SC12 11/14/12 23
A simple parallel program

forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }
SC12 11/14/12
24
Why is this easy?

forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }
SC12 11/14/12
No machine details All parallelism is expressed Synchronization is semantic (in reduction)

25
We could make it hard

pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization // manipulate struct unlock(struct.lock) ; code = send(pid, tag, &msg) ; // partition across nodes
SC12 11/14/12
26
Programmers, tools, and architecture Need to play their positions

Programmer
Tools
Architecture
SC12 11/14/12
27

Programmer Algorithm All of the parallelism Abstract locality
Tools
Architecture
SC12 11/14/12
Combinatorial optimization Mapping Selection of mechanisms
Fast mechanisms Exposed costs

28

Programmer
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }
Tools
Map foralls in time and space Map molecules across memories Stage data up/down hierarchy SC12 11/14/12 Select mechanisms
Architecture
Exposed storage hierarchy Fast comm/sync/thread mechanisms

29
Fundamental and Incidental Obstacles to Programmability

Fundamental
Expressing 109 way parallelism Expressing locality to deal with >100:1 global:local energy Balancing load across 109 cores
Incidental
Dealing with multiple address spaces Partitioning data across nodes Aggregating data to amortize message overhead
SC12 11/14/12 30
Parallel Programmability
Abstract Parallelism and Locality
Fast Communication Synchronization Thread Mgt

109 Threads Exposed Storage Hierarchy
107 Threads
Autotuning Mapper
SC12 11/14/12
31
NVIDIA Exascale Architecture
SC12 11/14/12
32
System Sketch
SC12 11/14/12
33
L2 Banks
XBAR
DRAM I/O
DRAM I/O
DRAM I/O
SM
NOC NOC NOC
NW I/O
SM SM
NOC NOC
17mm
DRAM I/O
SM SM SM SM SM
NOC
10nm process 290mm2
NOC
DRAM I/O
SM
Echelon Chip Floorplan
DRAM I/O
SM
NOC
SM SM SM SM
NOC NOC NOC
SM
NOC
SM SM SM SM
NOC NOC NOC
SM SM SM SM SM SM SM
NOC NOC
SM SM SM SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM SM
DRAM I/O
SM
NOC
SM SM SM SM
NOC NOC NOC
SM
NOC NOC
SM SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM
NOC
SM SM SM SM LOC SM
NOC
SM SM
SM SM LOC SM
NOC
NW I/O
LOC SM
NOC
LOC SM SM
NOC
SM SM SM SM SM
NOC
SM
NOC
SM SM SM
SM
NOC
SM SM
SM SM SM SM
NOC
DRAM I/O
NOC
NOC
SM SM SM SM
NOC
SM
NOC
SM SM SM SM SM SM SM
SM
NOC
SM SM SM
NOC
SM SM SM SM SM
NOC NOC
SM SM SM SM
NOC
SM SM
SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM SM SM SM SM SM SM SM SM
DRAM I/O
SM
NOC
NOC
NOC
NOC
SM SM SM
SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM
DRAM I/O
DRAM I/O
NW I/O
DRAM I/O
DRAM I/O
DRAM I/O
DRAM I/O
SM SM
NOC
SM SM SM
NOC
NW I/O
SM LOC LOC
SM LOC
DRAM I/O
SC12 11/14/12
34
SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM LOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM
Lane Lane Lane Lane
SM SM SM
SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM
NOC NOC NOC NOC NOC NOC NOC NOC NOC
Lane Lane Lane Lane
The fundamental problems are hard enough. We must eliminate the incidental ones.
SC12 11/14/12
35
Parallel Roads to Exascale

4 Process Nodes 6GFLOPS/W (3x) Advanced Signaling 24GFLOPs/W (2x) Optimized Voltages Efficient Memory Locality Enhancement (??x) 50GFLOPs/W (25x)
2GFLOPs/W
Efficient Integrate Microarchitecture Host 48GFLOPs/W 12GFLOPs/W (2x) Fast (2x) Abstract Communication Parallelism and Synchronization Locality Thread Mgt
SC12 11/14/12
Autotuning Mapper
Exposed Storage Hierarchy
109 Threads
36

BillDally NVIDIA SC12

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

BillDally NVIDIA SC12

Hochgeladen von

Copyright:

Verfügbare Formate

SC12 11/14/12

Titan: Worlds #1 Open Science Supercomputer

The Road to Exascale

You are Here

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Technical Challenges on The Road to Exascale

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Technical Challenges on The Road to Exascale

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Technical Challenges on The Road to Exascale

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Moores Law to the Rescue?

Moore, Electronics 38(8) April 19, 1965

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Chips are now power, not area limited

4 Process Nodes 6GFLOPS/W (3x)

Integrate Host 12GFLOPs/W (2x)

The High Cost of Data Movement

DRAM Rd/Wr Efficient off-chip link

Energy cost with efficient signaling

DRAM Rd/Wr Efficient off-chip link

4 Process Nodes 6GFLOPS/W (3x)

Integrate Host 12GFLOPs/W (2x)

RF L0Addr L1Addr Net

RF L0Addr L1Addr Net

4 Process Nodes 6GFLOPS/W (3x)

Integrate Host 12GFLOPs/W (2x)

Optimized Voltages Efficient Memory Locality Enhancement (??x) 50GFLOPs/W (25x)

4 Process Nodes 6GFLOPS/W (3x)

Integrate Host 12GFLOPs/W (2x)

Efficient Microarchitecture 48GFLOPs/W (2x)

A simple parallel program

Why is this easy?

No machine details All parallelism is expressed Synchronization is semantic (in reduction)

We could make it hard

Programmers, tools, and architecture Need to play their positions

Programmers, tools, and architecture Need to play their positions

Combinatorial optimization Mapping Selection of mechanisms

Fast mechanisms Exposed costs

Programmers, tools, and architecture Need to play their positions

Exposed storage hierarchy Fast comm/sync/thread mechanisms

Fundamental and Incidental Obstacles to Programmability

Fast Communication Synchronization Thread Mgt

NVIDIA Exascale Architecture

10nm process 290mm2

Echelon Chip Floorplan

Lane Lane Lane Lane

Lane Lane Lane Lane

Parallel Roads to Exascale

Exposed Storage Hierarchy

Das könnte Ihnen auch gefallen