Beruflich Dokumente
Kultur Dokumente
GAMING
PROENTERPRISE
VISUALIZATION
DATA CENTER
AUTO
TFLOPS
x86 CPU
3.0
K80
Productive
Programming
Model & Tools
Expert
Co-Design
Accessibility
APPLICATION
2.5
MIDDLEWARE
2.0
K40
1.5
K20
1.0
M2090
Fast GPU
+
Strong CPU
SYS SW
LARGE SYSTEMS
M1060
0.5
PROCESSOR
0.0
2008
2009
2010
2011
2012
2013
2014
3
GFLOPS
GB/s
3500
600
K80
3000
K80
500
2500
K40
2000
M2090
200
M2090
1000
M1060
Westmere
Haswell
Sandy Bridge
K40
300
K20
1500
500
400
Ivy Bridge
K20
M1060
100
Westmere
Haswell
Sandy Bridge Ivy Bridge
2008
2009
2010
NVIDIA GPU
2011
2012
x86 CPU
2013
2014
2008
2009
2010
2011
NVIDIA GPU
2012
2013
2014
x86 CPU
4
60
SGEMM / W
48
Pascal
Mixed Precision
3D Memory
NVLink
36
24
Maxwell
12
Kepler
Fermi
Tesla
2008
2010
2012
2014
2016
2018
Kepler SM (SMX)
Instruction Cache
Warp Scheduler
Warp Scheduler
Register File
Warp Scheduler
SP
SP
SP
Warp Scheduler
DP
SFU
LD/ST
!
192 CUDA cores!
SP
SP
SP
DP
SFU
LD/ST
Maxwell SM (SMM)
SMM
Simplified design
Instruction Cache
Tex / L1 $
power-of-two, quadrant-based
Tex / L1 $
Better utilization
Shared Memory
Instruction Buffer
Warp Scheduler
SP SP SP SP
SFU LD/ST
32 SP CUDA Cores
SP SP SP SP SFU LD/ST
Register File
Quadrant
Efficiency
<10% difference from SMX
~50% SMX chip area
Bandwidth/SM, GiB/s
7.5
6.0
5.5x
faster
4.5
3.0
1.5
0.0
16
32
64
128
Kepler K20X
2016
2017
MAXWELL M40
KEPLER K80
2xGPU, 2.9TF DP, 8.7TF SP
Peak
4.4TF SGEMM/1.59TF DGEMM
24GB, ~480 GB/s, 300W
PCIe Passive
MAXWELL M60
2xGPU, 7.4TF SP Peak,
~6TF SGEMM
16GB, 320 GB/s, 300W
PCIe Active/PCIe Passive
GRID
Enabled
KEPLER K40
MAXWELL M4
MAXWELL M6
1xGPU, TBD TF SP Peak,
8GB, 160 GB/s,
75-100W, MXM
POR
In Definition
GRID
Enabled
16
Software
Accelerators
HPC
Enterprise
Virtualization
Accelerated
Computing
Toolkit
GRID 2.0
Hyperscale
DL Training
Web Services
Hyperscale Suite
Tesla K80
Tesla M60, M6
Tesla M40
Tesla M4
17
NODE DESIGN
FLEXIBILITY
KEPLER GPU
PASCAL GPU
NVLink
NVLINK
HIGH-SPEED GPU
INTERCONNECT
POWER CPU
NVLink
PCIe
PCIe
X86, ARM64,
POWER CPU
X86, ARM64,
POWER CPU
2014
2016
18
NVLink
System
Memory
GPU Memory
Unified Memory
Unified Memory
Share Data Structures at
CPU Memory Speeds, not PCIe speeds
Oversubscribe GPU Memory
19
NVLINK
Multi-GPU Scaling
2x App Performance
SUMMIT
SIERRA
NOAA
IBM Watson
Pre-Exascale Supercomputers
for Science
22
23
ACCELERATORS SURGE IN
WORLDS TOP SUPERCOMPUTERS
125
100
Top500: # of Accelerated
Supercomputers
75
50
25
0
2013
2014
2015
25
26
Communication
Infrastructure
Management
Development
Programming
Languages
Development
Tools
Software
Solutions
GPU
Accelerators
Interconnect
System
Management
Compiler
Solutions
Profile and
Debug
Libraries
GPU Boost
GPU Direct
NVLink
NVML
LLVM
cuBLAS
370 GPU-Accelerated
Applications
www.nvidia.com/appscatalog
28
90%
Accelerated
70%
Accelerated
=
=
=
=
GROMACS
SIMULIA Abaqus
NAMD
AMBER
ANSYS Mechanical
Exelis IDL
MSC NASTRAN
LAMMPS
NWChem
LS-DYNA
Schrodinger
ANSYS Fluent
WRF
VASP
OpenFOAM
CHARMM
Quantum Espresso
ANSYS CFX
Star-CD
CCSM
COMSOL
Star-CCM+
BLAST
Gaussian
GAMESS
HYPERSCALE SUITE
Deep Learning
Toolkit
GPU Accelerated
FFmpeg
Image Compute
Engine
GPU support in
Mesos
TESLA M40
TESLA M4
POWERFUL
Fastest Deep Learning Performance
LOW POWER
Highest Hyperscale Throughput
33
34
DIRECTIVES
LANGUAGES
35
AmgX
cuRAND
cuFFT
cuBLAS
NPP
cuSPARSE
MATH
38
University of Illinois
main()
{
<serial code>
#pragma acc kernels
//automatically runs on GPU
OpenACC
Simple | Powerful |
Portable
<parallel code>
}
}
70x Speed-Up
2 Days of Effort
RIKEN Japan
8000+
Developers
using OpenACC
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway
39
http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf
http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
Minimal Effort
LS-DALTON
Large-scale Application for
Calculating High-accuracy
Molecular Energies
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines
1 Week
1 Source
Big Performance
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
Speedup vs CPU
12.0x
8.0x
4.0x
0.0x
Alanine-1
13 Atoms
Alanine-2
23 Atoms
Alanine-3
33 Atoms
40
35x
30x
30.3x
25x
20x
15x
10x
11.9x
7.6x
5x
0x
4.1x
4.3x
359.MINIGHOST (MANTEVO)
5.2x
5.3x
359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU
NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs
CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs
7.1x
7.1x
CLOVERLEAF (PHYSICS)
41
NVProf Profiler
GPU Wizard
Code Samples
http://developer.nvidia.com/openacc
Documentation
COURSE
REGION
March 2016
China
March 2016
India
Advanced OpenACC
Worldwide
Worldwide
May 2016
September 2016
PROGRAMMING LANGUAGES
Numerical analytics
Fortran
C
C++
Python
JAVA,C#
Free Academic
License
http://continuum.io
47
RAJA
https://e-reports-ext.llnl.gov/pdf/782261.pdf
Hemi
CUDA Portability
Library
http://github.com/harrism/hemi
THRUST LIBRARY
Programming with algorithms and policies today
Thrust Sort Speedup
CUDA 7.0 vs. 6.5 (32M samples)
2.0x
1.7x
1.8x
1.2x
1.3x
1.1x
1.1x
1.0x
char
short
int
long
float double
Prototype:
https://github.com/n3554/n3554
51
STANDARDIZING
PARALLEL STL
CUDA
Super Simplified Memory Management Code
CPU Code
void sortfile(FILE *fp, int N) {
char *data;
data = (char *)malloc(N);
fread(data, 1, N, fp);
fread(data, 1, N, fp);
qsort(data, N, 1, compare);
qsort<<<...>>>(data,N,1,compare);
cudaDeviceSynchronize();
use_data(data);
use_data(data);
free(data);
cudaFree(data);
}
53
54
49
INTRODUCING NCCL
Accelerating multi-GPU collective communications
GOAL:
Build a research library of accelerated collectives that is easily integrated and
topology-aware so as to improve the scalability of multi-GPU applications
APPROACH:
Pattern the library after MPIs collectives
55
50
Broadcast
All-Gather
Reduce
All-Reduce
Reduce-Scatter
Scatter
Gather
All-To-All
Neighborhood
Key Features
Single-node, up to 8 GPUs
Host-side API
Asynchronous/non-blocking interface
Topology Detection
NCCL IMPLEMENTATION
Implemented as monolithic CUDA C++ kernels combining the following:
GPUDirect P2P Direct Access
Three primitive operations: Copy, Reduce, ReduceAndCopy
57
52
NCCL EXAMPLE
All-reduce
#include <nccl.h>
ncclComm_t comm[4];
ncclCommInitAll(comm, 4, {0, 1, 2, 3});
foreach g in (GPUs) { // or foreach thread
cudaSetDevice(g);
NCCL PERFORMANCE
Bandwidth at different problem sizes (4 Maxwell GPUs)
Broadcast
All-Reduce
All-Gather
Reduce-Scatter
59
54
AVAILABLE NOW
github.com/NVIDIA/nccl
60
55
AmgX
cuBLAS
Compiler
Directives
Programming
Languages
x86
61
ANEO
Debuggers
& Profilers
cuda-gdb
NV Visual Profiler
Parallel Nsight
Visual Studio
Allinea
TotalView
GPU Compilers
C
C++
Fortran
Java
Python
Auto-parallelizing
& Cluster Tools
OpenACC
mCUDA
OpenMP
Ocelot
Libraries
BLAS
FFT
LAPACK
NPP
Video
Imaging
GPULib
GPU Tech
62
63
RESOURCES
Learn more about GPUs
CUDA resource center:
http://docs.nvidia.com/cuda
http://www.gputechconf.com/gtc-webinars
Self-paced labs:
http://nvidia.qwiklab.com
64
TEGRA TX1
65
KEY SPECS
JETSON TX1
Supercomputer on a module
GPU
CPU
Memory
Storage
16 GB eMMC
Wifi/BT
Networking
1 Gigabit Ethernet
Size
50mm x 87mm
Interface
Power
Under 10W
24
Under 10 W for typical use cases
GPU Compute
NVTX
NVIDIA Tools eXtension
40
35
30
25
20
15
10
5
0
Intel core i7-6700K
(Skylake)
Jetson TX1
26
PATH TO AN
AUTONOMOUS
DRONE
TODAYS
DRONE
(GPS-BASED)
CORE i7
JETSON
TX1
Performance*
1x
100x
100x
Power
(compute)
2W
60W
6W
Power
(mechanical)
70W
100W
80W
20 minutes
9 minutes
18 minutes
Flight Time
27
Comprehensive
developer platform
http://developer.nvidia.com/embedded-computing
28
J
Jetson
TX1 DeveloperKit
$599 retail
$
$$299 EDU
PrPre-order Nov 12
SShipping Nov 16 (US)
InIntl to follow
31
32
Tesla
for Cloud
Titan X
for PC
DRIVE PX
for Auto
Jetson
for Embedded
33
FIVE THINGS TO
REMEMBER
74