Sie sind auf Seite 1von 36

GPU in DNN

By Abdul Karim

Principal Supervisor: Professor Abdul Sattar


Associate Supervisor: MAHakim Newton
Why do we need specialized hardware for training Deep Neural Networks?

When you train a deep learning model, two main operations are
performed:
 But the real world data has
• Forward Pass Both involves matrix multiplication
normally hundreds and
• Backward Pass thousands of
dimensions/parameters.

 This seems to be very simple


task
Motivation

What does it take to reach human-level performance with a machine-learning algorithm?

Huge data set Appropriate ML


algorithm

Problem: My camera should identify each and every scene it sees like human eye
and then

I should train my deep Convolutional Neural Network with millions of images


Some Real world example: Places
http://places.csail.mit.edu/
new scene-centric database called Places

• A repository of 10 million scene photographs, labeled with scene semantic categories.


• Each image is of shape (64, 64, 3) where 3 is for the 3 channels (RGB).

Size of our training data-set

10 million x 12288
Forward pass A Reasonable Deep Neural Network
http://places.csail.mit.edu/
new scene-centric database called Places

) ) ) )
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error

12288
10 10 10
A Reasonable Deep Neural Network
http://places.csail.mit.edu/
new scene-centric database called Places

Weight update
Backward pass

12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10 1 Error

12288
10 10 10
 Is there any optimum error
Repeat until converge which is related to the number
of iteration? What does
{ convergence mean?
 Forward propagation to calculate the error or cost or objective function.
 Update the weights through backward propagation
}
 Does the weight update has
the info or any prints of the
previous weights?

 How is the optimization in


first iteration related to the
another immediate
iteration?
 Is there any optimum error which is related to the number of iteration?

Example of Human level error for Error


then same task:
DL Model

Human-level
Bayes Error Rate

Number of iterations

<<Deeplearning: Ian Goodfellow and Yoshua Bengio and Aaron Courville>>

 Bayes error rate is the lowest possible error rate for any classifier of a random outcome.


 Does the weight update has the info or any prints of the previous weights?
OR
 How is the optimization in first iteration related to the another immediate iteration?

Weight update
Yes it increments or decrements by a
factor
(w) Or decrement.

w
 In subsequent iteration, the weights step up or step down a little baby step from the previous weights.
 Not all the weights are updated, but only those weights for which , but is calculated in every iteration.

Our idea
 Somehow we won’t need to calculate this. But instead if we will compute other term which will reflect the variation in
input data.
 In each layer of the DNN, there is a multiplication
of the size (ox millions ) matrices.

12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error

12288
10 10 10
This effectively hides latency so that GPUs offer high
bandwidth while hiding their latency under thread parallelism Numbers

axbxc
a , b , c ,………

 CPU is Latency optimized……….Ferrari


 GPU is bandwidth optimized………..Truck
The best CPUs have about 50GB/s while the best GPUs have
750GB/s memory bandwidth Matrics

AxBxC A , B , C ,………

so for large chunks of memory GPUs provide the best memory


bandwidth while having almost no drawback due to latency via
thread parallelism
Why do we need GPU How specifically they help in matrix multiplication and CPU cycle usage?

What if we do the same task using CPU

Is there anything specific to back propagation which if we eliminate, can reduce the training time. Giving example on sample
dataset.

Algorithm to do above task.

Improvement on sample data set.

Performance on real world data set. In terms of CPU cycles or GPU cycles improvement
The only deep learning library which currently implements efficient algorithms across GPUs and across
computers is CNTK which uses Microsoft’s special parallelization algorithms of 1-bit quantization
(efficient) and block momentum (very efficient).

A simple way to understand the difference between a GPU and a CPU is to compare how they
process tasks. A CPU consists of a few cores optimized for sequential serial processing while a
GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores
designed for handling multiple tasks simultaneously.

• We need to write an interface program to


deploy our deep learning NN on GPU.

• Deep Learning frameworks like Tensors flow


can do it for us.
Select suitable framework

Select GPU
NVIDIA GPUs for deep learning are available in
desktops, notebooks, servers, and supercomputers
around the world, as well as in cloud services from
Amazon, IBM, Microsoft, and Google.
NVIDIA® DGX™ SYSTEMS
NVIDIA TITAN Xp 
3840 NVIDIA® CUDA® cores running at 1.6 GH
12 TFLOPs of brute force

NVIDIA Quadro® GP100


Definitions of FLOP

1)The simple plural of “FLOP” (i.e. “operation X takes 50


FLOPs”)

2) The rate of FLOPs in the first sense (i.e. floating-point


math operations per second)
 Lets assume that one FLOP is required to perform addition, multiplication, division and
exponential. (To be minimum possible to make some minimum criteria).

In real world
scenario,
multiplication is
expensive than
addition and so is
the division and
exponential
respectively.

<http://www.latkin.org/blog/2014/11/09/a-simple-benchmark-of-various-math-operations/>
FLOPS required for Matrix operations

[1] G. H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1991.
[2] Kh.D. Ikramov and N.V. Savel’eva, “Conditionally Definite Matrices,” Journal of Mathematical
Sciences, vol. 98, no. 1, pp. 1–50, 2000.
Forward pass A Reasonable Deep Neural Network

3 Flops per sigmoid

Number of FLOPS =
Addition= 1 Flop
Multiplication= 1 Flop
Division = 1 Flop

) ) ) )
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error

12288
10 10 10
12288x 10 million Forward pass

10x12288 10x10 10x10 1x10


10x10 million 10x10 million 10x10 million 1x10 million

Flops

MxL MxN NxL


10x10m 10x 12288 12288x 10m ( 2.4575 × 1012

10x10m
Total operation in sigmoid=3
Flop per operation=1 (  300 × 106
12288x 10 million Forward pass

10x12288 10x10 10x10 1x10


10x10 million 10x10 million 10x10 million 1x10 million

Flops
MxL MxN NxL
10x10m 10x 10 10x10m
( 1.9 × 109

10x10m
Total operation in sigmoid=3
Flop per operation=1 (  300 × 106
12288x 10 million Forward pass

10x12288 10x10 10x10 1x10


10x10 million 10x10 million 10x10 million 1x10 million

Flops
MxL MxN NxL
10x10m 10x 10 10x10m
( 1.9 × 109

10x10m
Total operation in sigmoid=3
Flop per operation=1 (  300 × 106
12288x 10 million Forward pass

10x12288 10x10 10x10 1x10


10x10 million 10x10 million 10x10 million 1x10 million

Flops
MxL MxN NxL
1x10m 1x 10 10x10m
( 190 × 106

1x10m
Total operation in sigmoid=3
30 × 106
Flop per operation=1 (
1x10 million
0.3 0.6 0.9 0.9 0.1 0.01 0.3 0.4 0.99 0.001

1 0 1 1 1 1 0 1 0 0
1x10 million

Simplest form of error:

1x10m Number of Substructions = 10m


1x1
Number of Squares = 10m
Number of additions= 10m

Total FLOPs in calculating error= 30 × 106


 We have selected Square loss as
our cost function and sigmoid as
our unit activation function.

Total flops in calculating


* )
300.00020999999 × 10 12

BY chain Rule

100.00001 × 1012 flops
10 million flops

199.99999 × 1012 flops

199.99999 × 106 flops
General Procedure for obtaining gradients in (L-1) to 1

)
12288x 10 million

) ) ) )

10x12288 10x10 10x10 1x10


10x10 million 10x10 million 1x10 million
10x10 million

MxL MxN NxL


Flops 10mx10m 10mx10 10x10m
10x10m 10x1 1x10m

(
10x10m 10x10m 10mx10m

( 1.9999999 × 1015 Flops (


1.9 × 1015 Flops

10x10 10x10m 10mx10

( 1.9999999 × 109 Flops


1.9000001 × 1015 Flops

Flops for gradient= 3.9000020999999 × 1015 Flops


12288x 10 million

) ) ) )

10x12288 10x10 10x10 1x10


10x10 million 10x10 million 1x10 million
10x10 million

MxL MxN NxL


Flops 10mx10m 10mx10 10x10m
10x10m 10x10 10x10m

( 1.9 × 109 Flops

10x10m 10x10m 10mx10m

( 1.9999999 × 1015 Flops (


1.9 × 1015 Flops

10x10 10x10m 10mx10

( 1.9999999 × 109 Flops


1.9000001 × 1015 Flops

Flops for gradient= 3.9000038999999 × 1015 Flops


12288x 10 million

) ) ) )

10x12288 10x10 10x10 1x10


10x10 million 10x10 million 1x10 million
10x10 million

MxL MxN NxL


Flops 10mx10m 10mx10 10x10m
10x10m 10x10 10x10m

( 1.9 × 109 Flops

10x10m 10x10m 10mx10m

( 1.9999999 × 1015 Flops (


1.9 × 1015 Flops

10x12288 10x10m 10m x 12288

( 2.45759987712 × 1012
Flops
1.9000001 × 1015 Flops

Flops for gradient= 3.90245949987710 × 1015 Flops


Number of Flops required for

10x12288 10x10 10x10 1x10

 Additions= 122880  Additions= 100  Additions= 100  Additions= 10


 Multiplications=122880  Multiplications=100  Multiplications=100  Multiplications=10

 Total Flops for  Total Flops for  Total Flops for  Total Flops for
Update=245760 Flops Update=200 Flops Update=200 Flops Update=20 Flops

Total Flops required for one Iteration =


DIGITS: Deep Learning GPU Training System
Open Source:

https://devblogs.nvidia.com/parallelforall/digits-deep-learning-gpu-training-system/
• In a recent talk at Center for Brains, Minds and Machines at MIT, I
heard Professor Josh Tenenbaum mention (someone else’s quote) -
“Deep Learning works very well in problems where there is a
repetitive structure in space or time”
For i:1, 10million

=
)

12288x 1 ) )
10x 1 10x 1 10x 1
)
1
1 1 1 1x 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1

12288
10 10 10
Steps for DNN Experiments

https://www.nvidia.com/en-us/deep-learning-ai/developer/