Sie sind auf Seite 1von 36

SC12 11/14/12

Titan: Worlds #1 Open Science Supercomputer

18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack

SC12 11/14/12

Titan

18,688 Kepler GK110 27 PF peak (90% from GPUs) 17.6PF HP Linpack 2.12 GF/W GK110 is 7GF/W
SC12 11/14/12 3

The Road to Exascale

You are Here


2020 1000PF (50x) 72,000HCNs (4x) 20MW (2x) 50GFLOPs/W (25x) ~109 Threads (100x)

SC12 11/14/12

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Technical Challenges on The Road to Exascale


1. Energy Efficiency
2020 1000PF (50x) 72,000HCNs (4x) 20MW (2x) 50GFLOPs/W (25x) ~109 Threads (100x)

SC12 11/14/12

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Technical Challenges on The Road to Exascale


1. Energy Efficiency 2. Parallel Programmability
2020 1000PF (50x) 72,000HCNs (4x) 20MW (2x) 50GFLOPs/W (25x) ~109 Threads (100x)

SC12 11/14/12

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Technical Challenges on The Road to Exascale


1. Energy Efficiency 2. Parallel Programmability 3. Resilience
2020 1000PF (50x) 72,000HCNs (4x) 20MW (2x) 50GFLOPs/W (25x) ~109 Threads (100x)

SC12 11/14/12

2012 20PF 18,000GPUs 10MW 2GFLOPs/W ~107 Threads

Energy Efficiency

SC12 11/14/12

Moores Law to the Rescue?

SC12 11/14/12

Moore, Electronics 38(8) April 19, 1965


9

Unfortunately Not!

SC12 11/14/12

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

10

Chips are now power, not area limited

P=150W

P=5W

Perf (Ops/s) = P(W) * Eff(Ops/J) Process is improving Eff by 15-25% per node 2-3x in 8 years
SC12 11/14/12 11

We need 25x energy efficiency 2-3x will come from process 10x must come from Architecture and Circuits

SC12 11/14/12

12

Energy Efficiency

4 Process Nodes 6GFLOPS/W (3x)

50GFLOPs/W (25x)

2GFLOPs/W

Integrate Host 12GFLOPs/W (2x)

SC12 11/14/12

13

The High Cost of Data Movement


Fetching operands costs more than computing on them 20mm
64-bit DP 20pJ 256-bit buses 256-bit access 8 kB SRAM

26 pJ

256 pJ

16 nJ 500 pJ

DRAM Rd/Wr Efficient off-chip link

50 pJ

1 nJ
SC12 11/14/12

28nm
14

Low-Energy Signaling

SC12 11/14/12

15

Energy cost with efficient signaling


20mm 64-bit DP 20pJ

3 pJ

30 pJ

1 nJ 100 pJ

DRAM Rd/Wr Efficient off-chip link

256-bit buses
256-bit access 8 kB SRAM

50 pJ

100 pJ
SC12 11/14/12

28nm
16

Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x) 50GFLOPs/W (25x)

4 Process Nodes 6GFLOPS/W (3x)

2GFLOPs/W

Integrate Host 12GFLOPs/W (2x)

SC12 11/14/12

17

An Out-of-Order Core
Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)

SC12 11/14/12

4/11/11

Milad Mohammadi

18

18

SM Lane Architecture
Control Path
Net

Data Path

Thread PCs

RF L0Addr L1Addr Net

LM Bank 0
To LD/ST

RF L0Addr L1Addr Net

LM Bank 3
To LD/ST

Scheduler

Active PCs

L0 I$

ORF
Inst

ORF

ORF

64 threads 4 active threads 2 DFMAs (4 FLOPS/clock) ORF bank: 16 entries (128 Bytes) L0 I$: 64 instructions (1KByte) LM Bank: 8KB (32KB total)

FP/Int

FP/Int

LS/BR

SC12 11/14/12

19

Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x) 50GFLOPs/W (25x) Efficient Microarchitecture 48GFLOPs/W (2x)

4 Process Nodes 6GFLOPS/W (3x)

2GFLOPs/W

Integrate Host 12GFLOPs/W (2x)

SC12 11/14/12

20

Energy Efficiency
Advanced Signaling 24GFLOPs/W (2x)

Optimized Voltages Efficient Memory Locality Enhancement (??x) 50GFLOPs/W (25x)

4 Process Nodes 6GFLOPS/W (3x)

2GFLOPs/W

Integrate Host 12GFLOPs/W (2x)

Efficient Microarchitecture 48GFLOPs/W (2x)

SC12 11/14/12

21

Parallel Programmability

SC12 11/14/12

22

Parallel programming is not inherently any more difficult than serial programming
However, we can make it a lot more difficult
SC12 11/14/12 23

A simple parallel program


forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }

SC12 11/14/12

24

Why is this easy?


forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }

SC12 11/14/12

No machine details All parallelism is expressed Synchronization is semantic (in reduction)


25

We could make it hard


pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization // manipulate struct unlock(struct.lock) ; code = send(pid, tag, &msg) ; // partition across nodes

SC12 11/14/12

26

Programmers, tools, and architecture Need to play their positions


Programmer

Tools

Architecture

SC12 11/14/12

27

Programmers, tools, and architecture Need to play their positions


Programmer Algorithm All of the parallelism Abstract locality

Tools

Architecture

SC12 11/14/12

Combinatorial optimization Mapping Selection of mechanisms

Fast mechanisms Exposed costs


28

Programmers, tools, and architecture Need to play their positions


Programmer
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }

Tools
Map foralls in time and space Map molecules across memories Stage data up/down hierarchy SC12 11/14/12 Select mechanisms

Architecture

Exposed storage hierarchy Fast comm/sync/thread mechanisms


29

Fundamental and Incidental Obstacles to Programmability


Fundamental
Expressing 109 way parallelism Expressing locality to deal with >100:1 global:local energy Balancing load across 109 cores

Incidental
Dealing with multiple address spaces Partitioning data across nodes Aggregating data to amortize message overhead
SC12 11/14/12 30

Parallel Programmability
Abstract Parallelism and Locality

Fast Communication Synchronization Thread Mgt


109 Threads Exposed Storage Hierarchy

107 Threads

Autotuning Mapper

SC12 11/14/12

31

NVIDIA Exascale Architecture

SC12 11/14/12

32

System Sketch

SC12 11/14/12

33

L2 Banks

XBAR

DRAM I/O

DRAM I/O
DRAM I/O
SM
NOC NOC NOC

NW I/O
SM SM
NOC NOC

17mm

DRAM I/O
SM SM SM SM SM
NOC

10nm process 290mm2

NOC

DRAM I/O
SM

Echelon Chip Floorplan

DRAM I/O

SM
NOC

SM SM SM SM
NOC NOC NOC

SM
NOC

SM SM SM SM
NOC NOC NOC

SM SM SM SM SM SM SM
NOC NOC

SM SM SM SM SM
NOC

SM SM SM SM SM
NOC

SM SM SM SM

DRAM I/O

SM
NOC

SM SM SM SM
NOC NOC NOC

SM
NOC NOC

SM SM SM SM SM
NOC

SM SM SM SM
NOC

SM SM SM
NOC

SM SM SM SM LOC SM
NOC

SM SM

SM SM LOC SM
NOC

NW I/O

LOC SM
NOC

LOC SM SM
NOC

SM SM SM SM SM
NOC

SM
NOC

SM SM SM

SM
NOC

SM SM

SM SM SM SM
NOC

DRAM I/O

NOC

NOC

SM SM SM SM
NOC

SM
NOC

SM SM SM SM SM SM SM

SM
NOC

SM SM SM
NOC

SM SM SM SM SM
NOC NOC

SM SM SM SM
NOC

SM SM

SM SM
NOC

SM SM SM SM SM
NOC

SM SM SM SM SM SM SM SM SM SM SM

DRAM I/O

SM
NOC

NOC

NOC

NOC

SM SM SM

SM SM
NOC

SM SM SM SM SM
NOC

SM SM SM SM SM
NOC

SM SM SM

DRAM I/O

DRAM I/O

NW I/O

DRAM I/O

DRAM I/O

DRAM I/O

DRAM I/O

SM SM
NOC

SM SM SM
NOC

NW I/O

SM LOC LOC

SM LOC

DRAM I/O

SC12 11/14/12

34
SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM LOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM

Lane Lane Lane Lane

SM SM SM
SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM
NOC NOC NOC NOC NOC NOC NOC NOC NOC

Lane Lane Lane Lane

The fundamental problems are hard enough. We must eliminate the incidental ones.

SC12 11/14/12

35

Parallel Roads to Exascale


4 Process Nodes 6GFLOPS/W (3x) Advanced Signaling 24GFLOPs/W (2x) Optimized Voltages Efficient Memory Locality Enhancement (??x) 50GFLOPs/W (25x)

2GFLOPs/W

Efficient Integrate Microarchitecture Host 48GFLOPs/W 12GFLOPs/W (2x) Fast (2x) Abstract Communication Parallelism and Synchronization Locality Thread Mgt

SC12 11/14/12

Autotuning Mapper

Exposed Storage Hierarchy

109 Threads
36

Das könnte Ihnen auch gefallen