Beruflich Dokumente
Kultur Dokumente
18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack
SC12 11/14/12
Titan
18,688 Kepler GK110 27 PF peak (90% from GPUs) 17.6PF HP Linpack 2.12 GF/W GK110 is 7GF/W
SC12 11/14/12 3
SC12 11/14/12
SC12 11/14/12
SC12 11/14/12
SC12 11/14/12
Energy Efficiency
SC12 11/14/12
SC12 11/14/12
Unfortunately Not!
SC12 11/14/12
10
P=150W
P=5W
Perf (Ops/s) = P(W) * Eff(Ops/J) Process is improving Eff by 15-25% per node 2-3x in 8 years
SC12 11/14/12 11
We need 25x energy efficiency 2-3x will come from process 10x must come from Architecture and Circuits
SC12 11/14/12
12
Energy Efficiency
50GFLOPs/W (25x)
2GFLOPs/W
SC12 11/14/12
13
26 pJ
256 pJ
16 nJ 500 pJ
50 pJ
1 nJ
SC12 11/14/12
28nm
14
Low-Energy Signaling
SC12 11/14/12
15
3 pJ
30 pJ
1 nJ 100 pJ
256-bit buses
256-bit access 8 kB SRAM
50 pJ
100 pJ
SC12 11/14/12
28nm
16
Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x) 50GFLOPs/W (25x)
2GFLOPs/W
SC12 11/14/12
17
An Out-of-Order Core
Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)
SC12 11/14/12
4/11/11
Milad Mohammadi
18
18
SM Lane Architecture
Control Path
Net
Data Path
Thread PCs
LM Bank 0
To LD/ST
LM Bank 3
To LD/ST
Scheduler
Active PCs
L0 I$
ORF
Inst
ORF
ORF
64 threads 4 active threads 2 DFMAs (4 FLOPS/clock) ORF bank: 16 entries (128 Bytes) L0 I$: 64 instructions (1KByte) LM Bank: 8KB (32KB total)
FP/Int
FP/Int
LS/BR
SC12 11/14/12
19
Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x) 50GFLOPs/W (25x) Efficient Microarchitecture 48GFLOPs/W (2x)
2GFLOPs/W
SC12 11/14/12
20
Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x)
2GFLOPs/W
SC12 11/14/12
21
Parallel Programmability
SC12 11/14/12
22
Parallel programming is not inherently any more difficult than serial programming
However, we can make it a lot more difficult
SC12 11/14/12 23
SC12 11/14/12
24
SC12 11/14/12
SC12 11/14/12
26
Tools
Architecture
SC12 11/14/12
27
Tools
Architecture
SC12 11/14/12
Tools
Map foralls in time and space Map molecules across memories Stage data up/down hierarchy SC12 11/14/12 Select mechanisms
Architecture
Incidental
Dealing with multiple address spaces Partitioning data across nodes Aggregating data to amortize message overhead
SC12 11/14/12 30
Parallel Programmability
Abstract Parallelism and Locality
107 Threads
Autotuning Mapper
SC12 11/14/12
31
SC12 11/14/12
32
System Sketch
SC12 11/14/12
33
L2 Banks
XBAR
DRAM I/O
DRAM I/O
DRAM I/O
SM
NOC NOC NOC
NW I/O
SM SM
NOC NOC
17mm
DRAM I/O
SM SM SM SM SM
NOC
NOC
DRAM I/O
SM
DRAM I/O
SM
NOC
SM SM SM SM
NOC NOC NOC
SM
NOC
SM SM SM SM
NOC NOC NOC
SM SM SM SM SM SM SM
NOC NOC
SM SM SM SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM SM
DRAM I/O
SM
NOC
SM SM SM SM
NOC NOC NOC
SM
NOC NOC
SM SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM
NOC
SM SM SM SM LOC SM
NOC
SM SM
SM SM LOC SM
NOC
NW I/O
LOC SM
NOC
LOC SM SM
NOC
SM SM SM SM SM
NOC
SM
NOC
SM SM SM
SM
NOC
SM SM
SM SM SM SM
NOC
DRAM I/O
NOC
NOC
SM SM SM SM
NOC
SM
NOC
SM SM SM SM SM SM SM
SM
NOC
SM SM SM
NOC
SM SM SM SM SM
NOC NOC
SM SM SM SM
NOC
SM SM
SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM SM SM SM SM SM SM SM SM
DRAM I/O
SM
NOC
NOC
NOC
NOC
SM SM SM
SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM SM SM
NOC
SM SM SM
DRAM I/O
DRAM I/O
NW I/O
DRAM I/O
DRAM I/O
DRAM I/O
DRAM I/O
SM SM
NOC
SM SM SM
NOC
NW I/O
SM LOC LOC
SM LOC
DRAM I/O
SC12 11/14/12
34
SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM LOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM
SM SM SM
SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM
NOC NOC NOC NOC NOC NOC NOC NOC NOC
The fundamental problems are hard enough. We must eliminate the incidental ones.
SC12 11/14/12
35
2GFLOPs/W
Efficient Integrate Microarchitecture Host 48GFLOPs/W 12GFLOPs/W (2x) Fast (2x) Abstract Communication Parallelism and Synchronization Locality Thread Mgt
SC12 11/14/12
Autotuning Mapper
109 Threads
36