Sie sind auf Seite 1von 59

REAL TIME CONTROL FOR

ADAPTIVE OPTICS WORKSHOP


(3RD EDITION)
27th January 2016

Franois Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com

GAMING

PROENTERPRISE
VISUALIZATION

DATA CENTER

AUTO

THE WORLD LEADER IN VISUAL COMPUTING


2

TESLA ACCELERATED COMPUTING PLATFORM


Focused on Co-Design from Top to Bottom
Fast GPU
Engineered for High Throughput
NVIDIA GPU

TFLOPS

x86 CPU

3.0

K80

Productive
Programming
Model & Tools

Expert
Co-Design

Accessibility

APPLICATION

2.5

MIDDLEWARE

2.0

K40

1.5

K20

1.0

M2090

Fast GPU
+
Strong CPU

SYS SW
LARGE SYSTEMS

M1060

0.5

PROCESSOR
0.0
2008

2009

2010

2011

2012

2013

2014
3

PERFORMANCE LEAD CONTINUES TO GROW


Peak Double Precision FLOPS

Peak Memory Bandwidth

GFLOPS

GB/s

3500

600

K80

3000

K80

500

2500

K40

2000

M2090
200

M2090

1000

M1060

Westmere

Haswell

Sandy Bridge

K40

300

K20

1500

500

400

Ivy Bridge

K20

M1060

100

Westmere

Haswell
Sandy Bridge Ivy Bridge

2008

2009

2010
NVIDIA GPU

2011

2012
x86 CPU

2013

2014

2008

2009

2010

2011

NVIDIA GPU

2012

2013

2014

x86 CPU
4

GPU Architecture Roadmap


72

60

SGEMM / W

48

Pascal

Mixed Precision
3D Memory
NVLink

36

24

Maxwell

12
Kepler

Fermi

Tesla

2008

2010

2012

2014

2016

2018

Kepler SM (SMX)
Instruction Cache
Warp Scheduler
Warp Scheduler
Register File

Warp Scheduler

SP

SP

SP

Warp Scheduler

DP

SFU

LD/ST

Scheduler not tied


to cores
Double issue for
max utilization

!
192 CUDA cores!

SP

SP

SP

DP

SFU

LD/ST

Shared Memory / L1 Cache


On-Chip Network
5

Maxwell SM (SMM)
SMM

Simplified design

Instruction Cache
Tex / L1 $

power-of-two, quadrant-based

Tex / L1 $

scheduler tied to cores

Better utilization
Shared Memory

single issue sufficient

Instruction Buffer

Warp Scheduler
SP SP SP SP

SFU LD/ST

32 SP CUDA Cores
SP SP SP SP SFU LD/ST

Register File

Quadrant

lower instruction latency

Efficiency
<10% difference from SMX
~50% SMX chip area

Histogram : Performance per SM


9.0

Bandwidth/SM, GiB/s

7.5

6.0

5.5x
faster

4.5

3.0

1.5

0.0

16

32

64

128

Elements per thread


Fermi M2070

Kepler K20X

Maxwell GTX 750 Ti

Higher performance expected with larger GPUS (more SMs)

TESLA GPU ACCELERATORS 2015-2016*


2015

2016

2017

MAXWELL M40
KEPLER K80
2xGPU, 2.9TF DP, 8.7TF SP
Peak
4.4TF SGEMM/1.59TF DGEMM
24GB, ~480 GB/s, 300W
PCIe Passive

1xGPU, 7TF SP Peak


(Boost Clock),
12GB, 288 GB/s, 250W
PCIe Passive

MAXWELL M60
2xGPU, 7.4TF SP Peak,
~6TF SGEMM
16GB, 320 GB/s, 300W
PCIe Active/PCIe Passive

GRID
Enabled

KEPLER K40

MAXWELL M4

1.43TF DP, 4.3TF SP Peak


3.3 TF SGEMM/1.22TF DGEMM
12 GB, 288 GB/s, 235W
PCIe Active/PCIe Passive

1xGPU, 2.2 TF SP Peak,


4GB, 88 GB/s,
50-75W, PCIe Low Profile

MAXWELL M6
1xGPU, TBD TF SP Peak,
8GB, 160 GB/s,
75-100W, MXM

POR

In Definition

*For End Customer Deployments

GRID
Enabled

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

16

TESLA PLATFORM PRODUCT STACK

Software

System Tools &


Services

Accelerators

HPC

Enterprise
Virtualization

Accelerated
Computing
Toolkit

GRID 2.0

Hyperscale
DL Training

Web Services

Hyperscale Suite

Enterprise Services Data Center GPU Manager Mesos Docker

Tesla K80

Tesla M60, M6

Tesla M40

Tesla M4

17

NODE DESIGN
FLEXIBILITY

KEPLER GPU

PASCAL GPU
NVLink

NVLINK

HIGH-SPEED GPU
INTERCONNECT

POWER CPU
NVLink

PCIe

PCIe

X86, ARM64,
POWER CPU

X86, ARM64,
POWER CPU

2014

2016
18

UNIFIED MEMORY: SIMPLER & FASTER WITH


NVLINK
Traditional Developer View

Developer View With


Unified Memory

Developer View With


Pascal & NVLink

NVLink

System
Memory

GPU Memory

Unified Memory

Unified Memory
Share Data Structures at
CPU Memory Speeds, not PCIe speeds
Oversubscribe GPU Memory
19

MOVE DATA WHERE IT IS NEEDED FAST


Accelerated Communication

GPU Direct P2P

GPU Direct RDMA

NVLINK

Multi-GPU Scaling

Fast Access to other


Nodes

2x App Performance

Fast GPU Communication

Eliminate CPU Latency

Fast GPU Memory Access


Eliminate CPU Bottleneck

5x Faster Than PCIe


Fast Access to System
Memory
20

NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED

SUMMIT

SIERRA

U.S. Dept. of Energy

NOAA

IBM Watson

Pre-Exascale Supercomputers
for Science

New Supercomputer for Next-Gen


Weather Forecasting

Breakthrough Natural Language


Processing for Cognitive Computing

22

U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS


Powered by the Tesla Platform
100-300 PFLOPS Peak
10x in Scientific App Performance

IBM POWER9 CPU + NVIDIA Volta GPU


NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017

Major Step Forward on the Path to Exascale

23

ACCELERATORS SURGE IN
WORLDS TOP SUPERCOMPUTERS
125

100

Top500: # of Accelerated
Supercomputers

100+ accelerated systems now on Top500 list

75

1/3 of total FLOPS powered by accelerators

50

NVIDIA Tesla GPUs sweep 23 of 24 new


accelerated supercomputers

25

Tesla supercomputers growing at 50% CAGR


over past five years

0
2013

2014

2015

25

TESLA PLATFORM FOR HPC

26

TESLA ACCELERATED COMPUTING PLATFORM


Data Center Infrastructure
System Solutions

Communication

Infrastructure
Management

Development
Programming
Languages

Development
Tools

Software
Solutions

GPU
Accelerators

Interconnect

System
Management

Compiler
Solutions

Profile and
Debug

Libraries

GPU Boost

GPU Direct
NVLink

NVML

LLVM

CUDA Debugging API

cuBLAS

Accelerators Will Be Installed in More than Half of


New Systems
Source: Top 6 predictions for HPC in 2015

In 2014, NVIDIA enjoyed a dominant market share with 85%


of the accelerator market.
27

370 GPU-Accelerated
Applications

www.nvidia.com/appscatalog
28

70% OF TOP HPC APPS ACCELERATED


INTERSECT360 SURVEY OF TOP APPS

Top 10 HPC Apps

Top 50 HPC Apps

90%
Accelerated

70%
Accelerated

Intersect360, Nov 2015


HPC Application Support for GPU Computing

TOP 25 APPS IN SURVEY

=
=
=
=

GROMACS
SIMULIA Abaqus
NAMD
AMBER
ANSYS Mechanical
Exelis IDL
MSC NASTRAN

LAMMPS
NWChem
LS-DYNA
Schrodinger

ANSYS Fluent
WRF
VASP
OpenFOAM
CHARMM
Quantum Espresso

ANSYS CFX
Star-CD
CCSM
COMSOL
Star-CCM+
BLAST

Gaussian
GAMESS

All popular functions accelerated


Some popular functions accelerated
In development
Not supported
29

TESLA FOR HYPERSCALE


http://devblogs.nvidia.com/parallelforall/accelerating-hyperscale-datacenter-applications-tesla-gpus/

HYPERSCALE SUITE

Deep Learning
Toolkit

GPU REST Engine

GPU Accelerated
FFmpeg

Image Compute
Engine

GPU support in
Mesos

TESLA M40

TESLA M4

POWERFUL
Fastest Deep Learning Performance

LOW POWER
Highest Hyperscale Throughput

33

TESLA PLATFORM FOR DEVELOPERS

34

TESLA FOR SIMLUATION


LIBRARIES

DIRECTIVES

LANGUAGES

ACCELERATED COMPUTING TOOLKIT

TESLA ACCELERATED COMPUTING

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

35

DROP-IN ACCELERATION WITH GPU LIBRARIES


BLAS | LAPACK | SPARSE | FFT
Math | Deep Learning | Image Processing

AmgX

cuRAND

cuFFT

cuBLAS

NPP

5x-10x speedups out of the


box
Automatically scale with multi-GPU libraries
(cuBLAS-XT, cuFFT-XT, AmgX,)

cuSPARSE

MATH

75% of developers use GPU


libraries to accelerate their
application
37

DROP-IN ACCELERATION: NVBLAS

38

University of Illinois

PowerGrid- MRI Reconstruction

main()
{
<serial code>
#pragma acc kernels
//automatically runs on GPU

OpenACC
Simple | Powerful |
Portable

<parallel code>

}
}

70x Speed-Up
2 Days of Effort
RIKEN Japan

NICAM- Climate Modeling

Fueling the Next Wave of

8000+
Developers

Scientific Discoveries in HPC


7-8x Speed-Up
5% of Code Modified

using OpenACC
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway
39
http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf
http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

Minimal Effort

LS-DALTON
Large-scale Application for
Calculating High-accuracy
Molecular Energies

Lines of Code
Modified

# of Weeks
Required

# of Codes to
Maintain

<100 Lines

1 Week

1 Source

Big Performance
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

OpenACC makes GPU computing approachable for


domain scientists. Initial OpenACC implementation
required only minor effort, and more importantly,
no modifications of our existing CPU implementation.

Janus Juul Eriksen, PhD Fellow


qLEAP Center for Theoretical Chemistry, Aarhus University

Speedup vs CPU

12.0x
8.0x
4.0x
0.0x
Alanine-1
13 Atoms

Alanine-2
23 Atoms

Alanine-3
33 Atoms
40

OPENACC DELIVERS TRUE PERFORMANCE


PORTABILITY
Paving the Path Forward: Single Code for All HPC Processors
Application Performance Benchmark
CPU: MPI + OpenMP

35x

CPU: MPI + OpenACC

CPU + GPU: MPI + OpenACC

Speedup vs Single CPU Core

30x

30.3x

25x
20x
15x

10x

11.9x
7.6x

5x
0x

4.1x

4.3x

359.MINIGHOST (MANTEVO)

5.2x

5.3x

NEMO (CLIMATE & OCEAN)

359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU
NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs
CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs

7.1x

7.1x

CLOVERLEAF (PHYSICS)

41

INTRODUCING THE NEW OPENACC TOOLKIT


Free Toolkit Offers Simple & Powerful Path to Accelerated Computing
PGI Compiler

Free OpenACC compiler for academia

NVProf Profiler

Easily find where to add compiler directives

GPU Wizard

Identify which GPU libraries can jumpstart code

Code Samples
http://developer.nvidia.com/openacc

Learn from examples of real-world algorithms

Documentation

Quick start guide, Best practices, Forums


42

FREE OPENACC COURSES


Begin Accelerating Applications with OpenACC
DATE

COURSE

REGION

March 2016

Intro to Performance Portability


with OpenACC

China

March 2016

Intro to Performance Portability


with OpenACC

India

Advanced OpenACC

Worldwide

Intro to Performance Portability


with OpenACC

Worldwide

May 2016

September 2016

Registration page: https://developer.nvidia.com/openacc-courses


Self-paced labs: http://nvidia.qwiklab.com
44

PROGRAMMING LANGUAGES
Numerical analytics
Fortran
C
C++

Python
JAVA,C#

MATLAB, Mathematica, LabVIEW, Scilab, Octave


OpenACC, CUDA Fortran
OpenACC, CUDA C
Thrust, CUDA C++, KOKKOS, RAJA, HEMI, OCCA

PyCUDA, Copperhead, Numba, Numbapro


GPU.NET, Hybridizer (Altimesh),JCUDA,CUDA4J
46

COMPILE PYTHON FOR PARALLEL ARCHITECTURES


Anaconda Accelerate from Continuum Analytics

NumbaPro array-oriented compiler for Python & NumPy


Compile for CPUs or GPUs (uses LLVM + NVIDIA Compiler SDK)

Fast Development + Fast Execution: Ideal Combination

Free Academic
License
http://continuum.io
47

MORE C++ PARALLEL FOR LOOPS


GPU Lambdas Enable Custom Parallel Programming Models
Kokkos
https://github.com/kokkos

RAJA
https://e-reports-ext.llnl.gov/pdf/782261.pdf

Hemi

CUDA Portability
Library
http://github.com/harrism/hemi

Kokkos::parallel_for(N, KOKKOS_LAMBDA (int i) {


y[i] = a * x[i] + y[i];
});

RAJA::forall<cuda_exec>(0, N, [=] __device__ (int i) {


y[i] = a * x[i] + y[i];
});

hemi::parallel_for(0, N, [=] HEMI_LAMBDA (int i) {


y[i] = a * x[i] + y[i];
});
49

THRUST LIBRARY
Programming with algorithms and policies today
Thrust Sort Speedup
CUDA 7.0 vs. 6.5 (32M samples)

Bundled with NVIDIAs CUDA Toolkit


Supports execution on GPUs and CPUs

2.0x

1.7x

1.8x

1.2x

1.3x
1.1x

1.1x

1.0x

Ongoing performance & feature improvements


0.0x

Functionality beyond Parallel STL

char

short

int

long

float double

From CUDA 7.0 Performance Report.


Run on K40m, ECC ON, input and output data on
device

Performance may vary based on OS and software


50
versions, and motherboard configuration

Portable, High-level Parallel Code TODAY


Thrust library allows the same C++ code to target both:
NVIDIA GPUs

x86, ARM and POWER CPUs

Thrust was the inspiration for a proposal to


the ISO C++ Committee
Committee voted unanimously to accept as
official tech. specification working draft
N3960 Technical Specification Working Draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf

Prototype:
https://github.com/n3554/n3554

51

Technical Specification for


C++ Extensions for Parallelism

STANDARDIZING
PARALLEL STL

Published as ISO/IEC TS 19570:2015, July 2015.

Draft available online


http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf

Weve proposed adding this to C++17


http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0024r0.html
13
52

CUDA
Super Simplified Memory Management Code
CPU Code
void sortfile(FILE *fp, int N) {
char *data;
data = (char *)malloc(N);

CUDA 6 Code with Unified Memory


void sortfile(FILE *fp, int N) {
char *data;
cudaMallocManaged(&data, N);

fread(data, 1, N, fp);

fread(data, 1, N, fp);

qsort(data, N, 1, compare);

qsort<<<...>>>(data,N,1,compare);
cudaDeviceSynchronize();

use_data(data);

use_data(data);

free(data);

cudaFree(data);
}

53

INTRODUCING NCCL (NICKEL):


ACCELERATED COLLECTIVES
FOR MULTI-GPU SYSTEMS

54
49

INTRODUCING NCCL
Accelerating multi-GPU collective communications
GOAL:
Build a research library of accelerated collectives that is easily integrated and
topology-aware so as to improve the scalability of multi-GPU applications
APPROACH:
Pattern the library after MPIs collectives

Handle the intra-node communication in an optimal way


Provide the necessary functionality for MPI to build on top to handle inter-node

55
50

NCCL FEATURES AND FUTURES


(Green = Currently available)
Collectives

Broadcast
All-Gather
Reduce
All-Reduce
Reduce-Scatter
Scatter
Gather
All-To-All
Neighborhood

Key Features

Single-node, up to 8 GPUs

Host-side API

Asynchronous/non-blocking interface

Multi-thread, multi-process support

In-place and out-of-place operation

Integration with MPI

Topology Detection

NVLink & PCIe/QPI* support


56
51

NCCL IMPLEMENTATION
Implemented as monolithic CUDA C++ kernels combining the following:
GPUDirect P2P Direct Access
Three primitive operations: Copy, Reduce, ReduceAndCopy

Intra-kernel synchronization between GPUs


One CUDA thread block per ring-direction

57
52

NCCL EXAMPLE
All-reduce
#include <nccl.h>
ncclComm_t comm[4];
ncclCommInitAll(comm, 4, {0, 1, 2, 3});
foreach g in (GPUs) { // or foreach thread
cudaSetDevice(g);

double *d_send, *d_recv;


// allocate d_send, d_recv; fill d_send with data
ncclAllReduce(d_send,d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);
d_recv,
// consume d_recv
}
58
53

NCCL PERFORMANCE
Bandwidth at different problem sizes (4 Maxwell GPUs)

Broadcast

All-Reduce

All-Gather

Reduce-Scatter

59
54

AVAILABLE NOW
github.com/NVIDIA/nccl

60
55

COMMON PROGRAMMING MODELS ACROSS


MULTIPLE CPUS
Libraries

AmgX
cuBLAS

Compiler
Directives
Programming
Languages

x86

61

GPU DEVELOPER ECO-SYSTEM


Numerical
Packages
MATLAB
Mathematica
NI LabView
pyCUDA

Consultants & Training

ANEO

Debuggers
& Profilers
cuda-gdb
NV Visual Profiler
Parallel Nsight
Visual Studio
Allinea
TotalView

GPU Compilers
C
C++
Fortran
Java
Python

Auto-parallelizing
& Cluster Tools

OpenACC
mCUDA
OpenMP
Ocelot

Libraries

BLAS
FFT
LAPACK
NPP
Video
Imaging
GPULib

OEM Solution Providers

GPU Tech

62

DEVELOP ON GEFORCE, DEPLOY ON TESLA

Designed for Developers & Gamers


Available Everywhere
https://developer.nvidia.com/cuda-gpus

Designed for the Data Center


ECC
24x7 Runtime
GPU Monitoring
Cluster Management
GPUDirect-RDMA
Hyper-Q for MPI
3 Year Warranty
Integrated OEM Systems, Professional Support

63

RESOURCES
Learn more about GPUs
CUDA resource center:
http://docs.nvidia.com/cuda

GTC on-demand and webinars:


http://on-demand-gtc.gputechconf.com

http://www.gputechconf.com/gtc-webinars

Parallel Forall Blog:


http://devblogs.nvidia.com/parallelforall

Self-paced labs:
http://nvidia.qwiklab.com

64

TEGRA TX1

65

KEY SPECS

JETSON TX1

Supercomputer on a module

GPU

1 TFLOP/s 256-core Maxwell

CPU

64-bit ARM A57 CPUs

Memory

4 GB LPDDR4 | 25.6 GB/s

Storage

16 GB eMMC

Wifi/BT

802.11 2x2 ac / BT Ready

Networking

1 Gigabit Ethernet

Size

50mm x 87mm

Interface

400 pin board-to-board connector

Power

Under 10W
24
Under 10 W for typical use cases

JETSON LINUX SDK


Graphics

Deep Learning and


Computer Vision

GPU Compute

NVTX
NVIDIA Tools eXtension

Debugger | Profiler | System Trace

10X ENERGY EFFICIENCY FOR


MACHINE LEARNING
Alexnet
50
45
Efficiency
Images/sec/Watt

40
35
30

25
20

15
10
5
0
Intel core i7-6700K
(Skylake)

Jetson TX1
26

PATH TO AN
AUTONOMOUS
DRONE

TODAYS
DRONE
(GPS-BASED)

CORE i7

JETSON
TX1

Performance*

1x

100x

100x

Power
(compute)

2W

60W

6W

Power
(mechanical)

70W

100W

80W

20 minutes

9 minutes

18 minutes

Flight Time

*Based on SGEMM performance

27

Comprehensive
developer platform
http://developer.nvidia.com/embedded-computing

28

J
Jetson
TX1 DeveloperKit
$599 retail
$
$$299 EDU
PrPre-order Nov 12
SShipping Nov 16 (US)
InIntl to follow

31

Jetson TX1 Module


$299 Available 1Q16
Distributors Worldwide

32

(1000 unit QTY)

ONE ARCHITECTURE END-TO-END AI

Tesla
for Cloud

Titan X
for PC

DRIVE PX
for Auto

Jetson
for Embedded

33

Time of accelerators has come


NVIDIA is focused on co-design from top-to-bottom

FIVE THINGS TO
REMEMBER

Accelerators are surging in supercomputing


Machine learning is the next killer application for HPC
Tesla platform leads in every way

74

Das könnte Ihnen auch gefallen