00 - 3rd Workshop AO V1.2

REAL TIME CONTROL FOR
ADAPTIVE OPTICS WORKSHOP

(3RD EDITION)
27th January 2016
Franois Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com
GAMING
PROENTERPRISE
VISUALIZATION
DATA CENTER
AUTO
THE WORLD LEADER IN VISUAL COMPUTING

2
TESLA ACCELERATED COMPUTING PLATFORM

Focused on Co-Design from Top to Bottom
Fast GPU
Engineered for High Throughput
NVIDIA GPU
TFLOPS
x86 CPU
3.0
K80
Productive
Programming
Model & Tools
Expert
Co-Design
Accessibility
APPLICATION
2.5
MIDDLEWARE
2.0
K40
1.5
K20
1.0
M2090
Fast GPU
+
Strong CPU
SYS SW
LARGE SYSTEMS
M1060
0.5
PROCESSOR
0.0
2008
2009
2010
2011
2012
2013
2014
3
PERFORMANCE LEAD CONTINUES TO GROW

Peak Double Precision FLOPS
Peak Memory Bandwidth
GFLOPS
GB/s
3500
600
K80
3000
K80
500
2500
K40
2000
M2090
200
M2090
1000
M1060
Westmere
Haswell
Sandy Bridge
K40
300
K20
1500
500
400
Ivy Bridge
K20
M1060
100
Westmere
Haswell
Sandy Bridge Ivy Bridge
2008
2009
2010
NVIDIA GPU
2011
2012
x86 CPU
2013
2014
2008
2009
2010
2011
NVIDIA GPU
2012
2013
2014
x86 CPU
4
GPU Architecture Roadmap

72
60
SGEMM / W
48
Pascal
Mixed Precision
3D Memory
NVLink
36
24
Maxwell
12
Kepler
Fermi
Tesla
2008
2010
2012
2014
2016
2018
Kepler SM (SMX)
Instruction Cache
Warp Scheduler
Warp Scheduler
Register File
Warp Scheduler
SP
SP
SP
Warp Scheduler
DP
SFU
LD/ST
Scheduler not tied

to cores
Double issue for
max utilization
!
192 CUDA cores!
SP
SP
SP
DP
SFU
LD/ST
Shared Memory / L1 Cache

On-Chip Network
5
Maxwell SM (SMM)
SMM
Simplified design
Instruction Cache
Tex / L1 $
power-of-two, quadrant-based
Tex / L1 $
scheduler tied to cores
Better utilization
Shared Memory
single issue sufficient
Instruction Buffer
Warp Scheduler
SP SP SP SP
SFU LD/ST
32 SP CUDA Cores
SP SP SP SP SFU LD/ST
Register File
Quadrant
lower instruction latency
Efficiency
<10% difference from SMX
~50% SMX chip area
Histogram : Performance per SM

9.0
Bandwidth/SM, GiB/s
7.5
6.0
5.5x
faster
4.5
3.0
1.5
0.0
16
32
64
128
Elements per thread

Fermi M2070
Kepler K20X
Maxwell GTX 750 Ti
Higher performance expected with larger GPUS (more SMs)
TESLA GPU ACCELERATORS 2015-2016*

2015
2016
2017
MAXWELL M40
KEPLER K80
2xGPU, 2.9TF DP, 8.7TF SP
Peak
4.4TF SGEMM/1.59TF DGEMM
24GB, ~480 GB/s, 300W
PCIe Passive
1xGPU, 7TF SP Peak

(Boost Clock),
12GB, 288 GB/s, 250W
PCIe Passive
MAXWELL M60
2xGPU, 7.4TF SP Peak,
~6TF SGEMM
16GB, 320 GB/s, 300W
PCIe Active/PCIe Passive
GRID
Enabled
KEPLER K40
MAXWELL M4
1.43TF DP, 4.3TF SP Peak

3.3 TF SGEMM/1.22TF DGEMM
12 GB, 288 GB/s, 235W
PCIe Active/PCIe Passive
1xGPU, 2.2 TF SP Peak,

4GB, 88 GB/s,
50-75W, PCIe Low Profile
MAXWELL M6
1xGPU, TBD TF SP Peak,
8GB, 160 GB/s,
75-100W, MXM
POR
In Definition
*For End Customer Deployments
GRID
Enabled
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
16
TESLA PLATFORM PRODUCT STACK
Software
System Tools &

Services
Accelerators
HPC
Enterprise
Virtualization
Accelerated
Computing
Toolkit
GRID 2.0
Hyperscale
DL Training
Web Services
Hyperscale Suite
Enterprise Services Data Center GPU Manager Mesos Docker
Tesla K80
Tesla M60, M6
Tesla M40
Tesla M4
17
NODE DESIGN
FLEXIBILITY
KEPLER GPU
PASCAL GPU
NVLink
NVLINK
HIGH-SPEED GPU
INTERCONNECT
POWER CPU
NVLink
PCIe
PCIe
X86, ARM64,
POWER CPU
X86, ARM64,
POWER CPU
2014
2016
18
UNIFIED MEMORY: SIMPLER & FASTER WITH

NVLINK
Traditional Developer View
Developer View With

Unified Memory
Developer View With

Pascal & NVLink
NVLink
System
Memory
GPU Memory
Unified Memory
Unified Memory
Share Data Structures at
CPU Memory Speeds, not PCIe speeds
Oversubscribe GPU Memory
19
MOVE DATA WHERE IT IS NEEDED FAST

Accelerated Communication
GPU Direct P2P
GPU Direct RDMA
NVLINK
Multi-GPU Scaling
Fast Access to other

Nodes
2x App Performance
Fast GPU Communication
Eliminate CPU Latency
Fast GPU Memory Access

Eliminate CPU Bottleneck
5x Faster Than PCIe

Fast Access to System
Memory
20
NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED
SUMMIT
SIERRA
U.S. Dept. of Energy
NOAA
IBM Watson
Pre-Exascale Supercomputers
for Science
New Supercomputer for Next-Gen

Weather Forecasting
Breakthrough Natural Language

Processing for Cognitive Computing
22
U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS

Powered by the Tesla Platform
100-300 PFLOPS Peak
10x in Scientific App Performance
IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017
Major Step Forward on the Path to Exascale
23
ACCELERATORS SURGE IN
WORLDS TOP SUPERCOMPUTERS
125
100
Top500: # of Accelerated
Supercomputers
100+ accelerated systems now on Top500 list
75
1/3 of total FLOPS powered by accelerators
50
NVIDIA Tesla GPUs sweep 23 of 24 new

accelerated supercomputers
25
Tesla supercomputers growing at 50% CAGR

over past five years
0
2013
2014
2015
25
TESLA PLATFORM FOR HPC
26
TESLA ACCELERATED COMPUTING PLATFORM

Data Center Infrastructure
System Solutions
Communication
Infrastructure
Management
Development
Programming
Languages
Development
Tools
Software
Solutions
GPU
Accelerators
Interconnect
System
Management
Compiler
Solutions
Profile and
Debug
Libraries
GPU Boost
GPU Direct
NVLink
NVML
LLVM
CUDA Debugging API
cuBLAS
Accelerators Will Be Installed in More than Half of

New Systems
Source: Top 6 predictions for HPC in 2015
In 2014, NVIDIA enjoyed a dominant market share with 85%

of the accelerator market.
27
370 GPU-Accelerated
Applications
www.nvidia.com/appscatalog
28
70% OF TOP HPC APPS ACCELERATED

INTERSECT360 SURVEY OF TOP APPS
Top 10 HPC Apps
Top 50 HPC Apps
90%
Accelerated
70%
Accelerated
Intersect360, Nov 2015

HPC Application Support for GPU Computing
TOP 25 APPS IN SURVEY
=
=
=
=
GROMACS
SIMULIA Abaqus
NAMD
AMBER
ANSYS Mechanical
Exelis IDL
MSC NASTRAN
LAMMPS
NWChem
LS-DYNA
Schrodinger
ANSYS Fluent
WRF
VASP
OpenFOAM
CHARMM
Quantum Espresso
ANSYS CFX
Star-CD
CCSM
COMSOL
Star-CCM+
BLAST
Gaussian
GAMESS
All popular functions accelerated

Some popular functions accelerated
In development
Not supported
29
TESLA FOR HYPERSCALE

http://devblogs.nvidia.com/parallelforall/accelerating-hyperscale-datacenter-applications-tesla-gpus/
HYPERSCALE SUITE
Deep Learning
Toolkit
GPU REST Engine
GPU Accelerated
FFmpeg
Image Compute
Engine
GPU support in
Mesos
TESLA M40
TESLA M4
POWERFUL
Fastest Deep Learning Performance
LOW POWER
Highest Hyperscale Throughput
33
TESLA PLATFORM FOR DEVELOPERS
34
TESLA FOR SIMLUATION

LIBRARIES
DIRECTIVES
LANGUAGES
ACCELERATED COMPUTING TOOLKIT
TESLA ACCELERATED COMPUTING
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
35
DROP-IN ACCELERATION WITH GPU LIBRARIES

BLAS | LAPACK | SPARSE | FFT
Math | Deep Learning | Image Processing
AmgX
cuRAND
cuFFT
cuBLAS
NPP
5x-10x speedups out of the

box
Automatically scale with multi-GPU libraries
(cuBLAS-XT, cuFFT-XT, AmgX,)
cuSPARSE
MATH
75% of developers use GPU

libraries to accelerate their
application
37
DROP-IN ACCELERATION: NVBLAS
38
University of Illinois
PowerGrid- MRI Reconstruction
main()
{
<serial code>
#pragma acc kernels
//automatically runs on GPU
OpenACC
Simple | Powerful |
Portable
<parallel code>
}
}
70x Speed-Up
2 Days of Effort
RIKEN Japan
NICAM- Climate Modeling
Fueling the Next Wave of
8000+
Developers
Scientific Discoveries in HPC

7-8x Speed-Up
5% of Code Modified
using OpenACC
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway
39
http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf
http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
Minimal Effort
LS-DALTON
Large-scale Application for
Calculating High-accuracy
Molecular Energies
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines
1 Week
1 Source
Big Performance
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
OpenACC makes GPU computing approachable for

domain scientists. Initial OpenACC implementation
required only minor effort, and more importantly,
no modifications of our existing CPU implementation.
Janus Juul Eriksen, PhD Fellow

qLEAP Center for Theoretical Chemistry, Aarhus University
Speedup vs CPU
12.0x
8.0x
4.0x
0.0x
Alanine-1
13 Atoms
Alanine-2
23 Atoms
Alanine-3
33 Atoms
40
OPENACC DELIVERS TRUE PERFORMANCE

PORTABILITY
Paving the Path Forward: Single Code for All HPC Processors
Application Performance Benchmark
CPU: MPI + OpenMP
35x
CPU: MPI + OpenACC
CPU + GPU: MPI + OpenACC
Speedup vs Single CPU Core
30x
30.3x
25x
20x
15x
10x
11.9x
7.6x
5x
0x
4.1x
4.3x
359.MINIGHOST (MANTEVO)
5.2x
5.3x
NEMO (CLIMATE & OCEAN)
359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU
NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs
CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs
7.1x
7.1x
CLOVERLEAF (PHYSICS)
41
INTRODUCING THE NEW OPENACC TOOLKIT

Free Toolkit Offers Simple & Powerful Path to Accelerated Computing
PGI Compiler
Free OpenACC compiler for academia
NVProf Profiler
Easily find where to add compiler directives
GPU Wizard
Identify which GPU libraries can jumpstart code
Code Samples
http://developer.nvidia.com/openacc
Learn from examples of real-world algorithms
Documentation
Quick start guide, Best practices, Forums

42
FREE OPENACC COURSES

Begin Accelerating Applications with OpenACC
DATE
COURSE
REGION
March 2016
Intro to Performance Portability

with OpenACC
China
March 2016

with OpenACC
India
Advanced OpenACC
Worldwide

with OpenACC
Worldwide
May 2016
September 2016
Registration page: https://developer.nvidia.com/openacc-courses

Self-paced labs: http://nvidia.qwiklab.com
44
PROGRAMMING LANGUAGES
Numerical analytics
Fortran
C
C++
Python
JAVA,C#
MATLAB, Mathematica, LabVIEW, Scilab, Octave

OpenACC, CUDA Fortran
OpenACC, CUDA C
Thrust, CUDA C++, KOKKOS, RAJA, HEMI, OCCA
PyCUDA, Copperhead, Numba, Numbapro

GPU.NET, Hybridizer (Altimesh),JCUDA,CUDA4J
46
COMPILE PYTHON FOR PARALLEL ARCHITECTURES

Anaconda Accelerate from Continuum Analytics
NumbaPro array-oriented compiler for Python & NumPy

Compile for CPUs or GPUs (uses LLVM + NVIDIA Compiler SDK)
Fast Development + Fast Execution: Ideal Combination
Free Academic
License
http://continuum.io
47
MORE C++ PARALLEL FOR LOOPS

GPU Lambdas Enable Custom Parallel Programming Models
Kokkos
https://github.com/kokkos
RAJA
https://e-reports-ext.llnl.gov/pdf/782261.pdf
Hemi
CUDA Portability
Library
http://github.com/harrism/hemi
Kokkos::parallel_for(N, KOKKOS_LAMBDA (int i) {

y[i] = a * x[i] + y[i];
});
RAJA::forall<cuda_exec>(0, N, [=] __device__ (int i) {

y[i] = a * x[i] + y[i];
});
hemi::parallel_for(0, N, [=] HEMI_LAMBDA (int i) {

y[i] = a * x[i] + y[i];
});
49
THRUST LIBRARY
Programming with algorithms and policies today
Thrust Sort Speedup
CUDA 7.0 vs. 6.5 (32M samples)
Bundled with NVIDIAs CUDA Toolkit

Supports execution on GPUs and CPUs
2.0x
1.7x
1.8x
1.2x
1.3x
1.1x
1.1x
1.0x
Ongoing performance & feature improvements

0.0x
Functionality beyond Parallel STL
char
short
int
long
float double
From CUDA 7.0 Performance Report.

Run on K40m, ECC ON, input and output data on
device
Performance may vary based on OS and software

50
versions, and motherboard configuration
Portable, High-level Parallel Code TODAY

Thrust library allows the same C++ code to target both:
NVIDIA GPUs
x86, ARM and POWER CPUs
Thrust was the inspiration for a proposal to

the ISO C++ Committee
Committee voted unanimously to accept as
official tech. specification working draft
N3960 Technical Specification Working Draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf
Prototype:
https://github.com/n3554/n3554
51
Technical Specification for

C++ Extensions for Parallelism
STANDARDIZING
PARALLEL STL
Published as ISO/IEC TS 19570:2015, July 2015.
Draft available online

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf
Weve proposed adding this to C++17

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0024r0.html
13
52
CUDA
Super Simplified Memory Management Code
CPU Code
void sortfile(FILE *fp, int N) {
char *data;
data = (char *)malloc(N);
CUDA 6 Code with Unified Memory

void sortfile(FILE *fp, int N) {
char *data;
cudaMallocManaged(&data, N);
fread(data, 1, N, fp);
fread(data, 1, N, fp);
qsort(data, N, 1, compare);
qsort<<<...>>>(data,N,1,compare);
cudaDeviceSynchronize();
use_data(data);
use_data(data);
free(data);
cudaFree(data);
}
53
INTRODUCING NCCL (NICKEL):

ACCELERATED COLLECTIVES
FOR MULTI-GPU SYSTEMS
54
49
INTRODUCING NCCL
Accelerating multi-GPU collective communications
GOAL:
Build a research library of accelerated collectives that is easily integrated and
topology-aware so as to improve the scalability of multi-GPU applications
APPROACH:
Pattern the library after MPIs collectives
Handle the intra-node communication in an optimal way

Provide the necessary functionality for MPI to build on top to handle inter-node
55
50
NCCL FEATURES AND FUTURES

(Green = Currently available)
Collectives
Broadcast
All-Gather
Reduce
All-Reduce
Reduce-Scatter
Scatter
Gather
All-To-All
Neighborhood
Key Features
Single-node, up to 8 GPUs
Host-side API
Asynchronous/non-blocking interface
Multi-thread, multi-process support
In-place and out-of-place operation
Integration with MPI
Topology Detection
NVLink & PCIe/QPI* support

56
51
NCCL IMPLEMENTATION
Implemented as monolithic CUDA C++ kernels combining the following:
GPUDirect P2P Direct Access
Three primitive operations: Copy, Reduce, ReduceAndCopy
Intra-kernel synchronization between GPUs

One CUDA thread block per ring-direction
57
52
NCCL EXAMPLE
All-reduce
#include <nccl.h>
ncclComm_t comm[4];
ncclCommInitAll(comm, 4, {0, 1, 2, 3});
foreach g in (GPUs) { // or foreach thread
cudaSetDevice(g);
double *d_send, *d_recv;

// allocate d_send, d_recv; fill d_send with data
ncclAllReduce(d_send,d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);
d_recv,
// consume d_recv
}
58
53
NCCL PERFORMANCE
Bandwidth at different problem sizes (4 Maxwell GPUs)
Broadcast
All-Reduce
All-Gather
Reduce-Scatter
59
54
AVAILABLE NOW
github.com/NVIDIA/nccl
60
55
COMMON PROGRAMMING MODELS ACROSS

MULTIPLE CPUS
Libraries
AmgX
cuBLAS
Compiler
Directives
Programming
Languages
x86
61
GPU DEVELOPER ECO-SYSTEM

Numerical
Packages
MATLAB
Mathematica
NI LabView
pyCUDA
Consultants & Training
ANEO
Debuggers
& Profilers
cuda-gdb
NV Visual Profiler
Parallel Nsight
Visual Studio
Allinea
TotalView
GPU Compilers
C
C++
Fortran
Java
Python
Auto-parallelizing
& Cluster Tools
OpenACC
mCUDA
OpenMP
Ocelot
Libraries
BLAS
FFT
LAPACK
NPP
Video
Imaging
GPULib
OEM Solution Providers
GPU Tech
62
DEVELOP ON GEFORCE, DEPLOY ON TESLA
Designed for Developers & Gamers

Available Everywhere
https://developer.nvidia.com/cuda-gpus
Designed for the Data Center

ECC
24x7 Runtime
GPU Monitoring
Cluster Management
GPUDirect-RDMA
Hyper-Q for MPI
3 Year Warranty
Integrated OEM Systems, Professional Support
63
RESOURCES
Learn more about GPUs
CUDA resource center:
http://docs.nvidia.com/cuda
GTC on-demand and webinars:

http://on-demand-gtc.gputechconf.com
http://www.gputechconf.com/gtc-webinars
Parallel Forall Blog:

http://devblogs.nvidia.com/parallelforall
Self-paced labs:
http://nvidia.qwiklab.com
64
TEGRA TX1
65
KEY SPECS
JETSON TX1
Supercomputer on a module
GPU
1 TFLOP/s 256-core Maxwell
CPU
64-bit ARM A57 CPUs
Memory
4 GB LPDDR4 | 25.6 GB/s
Storage
16 GB eMMC
Wifi/BT
802.11 2x2 ac / BT Ready
Networking
1 Gigabit Ethernet
Size
50mm x 87mm
Interface
400 pin board-to-board connector
Power
Under 10W
24
Under 10 W for typical use cases
JETSON LINUX SDK

Graphics
Deep Learning and

Computer Vision
GPU Compute
NVTX
NVIDIA Tools eXtension
Debugger | Profiler | System Trace
10X ENERGY EFFICIENCY FOR

MACHINE LEARNING
Alexnet
50
45
Efficiency
Images/sec/Watt
40
35
30
25
20
15
10
5
0
Intel core i7-6700K
(Skylake)
Jetson TX1
26
PATH TO AN
AUTONOMOUS
DRONE
TODAYS
DRONE
(GPS-BASED)
CORE i7
JETSON
TX1
Performance*
1x
100x
100x
Power
(compute)
2W
60W
6W
Power
(mechanical)
70W
100W
80W
20 minutes
9 minutes
18 minutes
Flight Time
*Based on SGEMM performance
27
Comprehensive
developer platform
http://developer.nvidia.com/embedded-computing
28
J
Jetson
TX1 DeveloperKit
$599 retail
$
$$299 EDU
PrPre-order Nov 12
SShipping Nov 16 (US)
InIntl to follow
31
Jetson TX1 Module

$299 Available 1Q16
Distributors Worldwide
32
(1000 unit QTY)
ONE ARCHITECTURE END-TO-END AI
Tesla
for Cloud
Titan X
for PC
DRIVE PX
for Auto
Jetson
for Embedded
33
Time of accelerators has come

NVIDIA is focused on co-design from top-to-bottom
FIVE THINGS TO
REMEMBER
Accelerators are surging in supercomputing

Machine learning is the next killer application for HPC
Tesla platform leads in every way
74

00 - 3rd Workshop AO V1.2

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

00 - 3rd Workshop AO V1.2

Hochgeladen von

Copyright:

Verfügbare Formate

REAL TIME CONTROL FOR

ADAPTIVE OPTICS WORKSHOP

Franois Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com

THE WORLD LEADER IN VISUAL COMPUTING

TESLA ACCELERATED COMPUTING PLATFORM

PERFORMANCE LEAD CONTINUES TO GROW

Peak Memory Bandwidth

GPU Architecture Roadmap

Scheduler not tied

Shared Memory / L1 Cache

scheduler tied to cores

single issue sufficient

lower instruction latency

Histogram : Performance per SM

Elements per thread

Maxwell GTX 750 Ti

Higher performance expected with larger GPUS (more SMs)

TESLA GPU ACCELERATORS 2015-2016*

1xGPU, 7TF SP Peak

1.43TF DP, 4.3TF SP Peak

1xGPU, 2.2 TF SP Peak,

*For End Customer Deployments

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA PLATFORM PRODUCT STACK

System Tools &

Enterprise Services Data Center GPU Manager Mesos Docker

UNIFIED MEMORY: SIMPLER & FASTER WITH

Developer View With

Developer View With

MOVE DATA WHERE IT IS NEEDED FAST

GPU Direct P2P

GPU Direct RDMA

Fast Access to other

Fast GPU Communication

Eliminate CPU Latency

Fast GPU Memory Access

5x Faster Than PCIe

NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED

U.S. Dept. of Energy

New Supercomputer for Next-Gen

Breakthrough Natural Language

U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS

IBM POWER9 CPU + NVIDIA Volta GPU

Major Step Forward on the Path to Exascale

100+ accelerated systems now on Top500 list

1/3 of total FLOPS powered by accelerators

NVIDIA Tesla GPUs sweep 23 of 24 new

Tesla supercomputers growing at 50% CAGR

TESLA PLATFORM FOR HPC

TESLA ACCELERATED COMPUTING PLATFORM

CUDA Debugging API

Accelerators Will Be Installed in More than Half of

In 2014, NVIDIA enjoyed a dominant market share with 85%

70% OF TOP HPC APPS ACCELERATED

Top 10 HPC Apps

Top 50 HPC Apps

Intersect360, Nov 2015

TOP 25 APPS IN SURVEY

All popular functions accelerated

TESLA FOR HYPERSCALE

GPU REST Engine

TESLA PLATFORM FOR DEVELOPERS

TESLA FOR SIMLUATION

ACCELERATED COMPUTING TOOLKIT

RAJA::forall<cuda_exec>(0, N, [=] device (int i) {

double d_send, d_recv;