Beruflich Dokumente
Kultur Dokumente
EFFICIENT DATA
PARTITIONING FOR
NEURAL NETWORKS
Graph source : Dean, Jeff, David Patterson, and Cliff Young. "A new golden age in computer architecture: Empowering the
machine-learning revolution." IEEE Micro 38.2 (2018): 21-29.
Neural network
• Driving factors for neural network optimization:
– Large models
– Huge data processing
• Key efficiency metrics:
– Compute:
• Parallelization to achieve efficient performance gain
– Memory:
• Reduce communication and synchronization overhead
Training vs Inference
Training is more
challenging
– Roughly 3 times
more computations
– Increased storage
requirements
– Large DNN models
and Datasets
Model parallelism
Each output channel is independent
(M).
Divide CNN model (weights) among
the processing units.
Convolutional Neural Networks
Input feature map: H*W*N
Output feature map: H*W*M
Weight parameters: K2*N
Number of computations:
H*W*N*K2*M
samples.
10000
• Pros:
– More work for GPU to occupy it’s 5000
resources.
0
• Cons: 1 8 16 32
Batch Size
64 128 256
10
64 batch size respectively.
0
• Even further batch size 4 8 16
Batch Size
32 64 128
• Pros: 1400000
1200000
No of accesses
– Smaller weight matrix can be stored 1000000
800000
locally in shared memory. 600000
– Reduced global memory traffic. 400000
200000
• Cons: 0
Shared memory L1 L2 Device
– Unable to fully occupy GPU. Model Data
0
1 8 16 32 64 128 256
Batch Size
1.5
• Significant number of 1
0.5
channels required to occupy
0
the GPU. 4 8 16 32 64 128
Number of output channels
footprint.
K40 - Data parallel K40 - Hybrid
K40 - Model Parallel
Hybrid Approach
Execution Time vs Batch size Execution Time vs Batch size Execution Time vs Batch size
Execution time for 1 epoch (s)
0 0 2000
1 8 16 32 64 128 256 1 8 16 32 64 128 256
0
Batch Size Batch Size 1 8 16 32 64 128 256
Batch Size
K40 - Data parallel GTX 1080Ti - Hybrid
K40 - Hybrid GTX 1080Ti - Model Parallel P100 - Data parallel P100 - Hybrid
K40 - Model Parallel GTX 1080Ti - Data Parallel P100 - Model Parallel
40
The crossover point of AlexNet and Resnet is around 64 batch size, but Lenet has early
crossover, due to the dimensional channels of the former CNN architectures.
Layer Fusion
• Even with above optimizations,
each layer has to store data and
next layer has to load it back.
• We can further reduce memory
dependency by storing
intermediate results locally.
• Possible by partitioning input
feature space.
Layer Fusion
• Even with above optimizations,
each layer has to store data and
next layer has to load it back.
• We can further reduce memory
dependency by storing
intermediate results locally.
• Possible by partitioning input
feature space.
Layer Fusion
• A bigger tile in Layer 1 serves a smaller
tile in Layer 2 to produce a smaller tile of
output feature.
• Cause re-computations due to tile
overlap
• Constraints:
– Intermediate tile should be small enough to fit
in shared memory.
– Data locality befit should outweigh the amount
of re-computations.
Layer Fusion
• Counter intuitively, the Fused vs non-fused implementation
execution time increase 10
increases by ~6x.
Layer Fusion – Memory dependency
Regular Fused
• The compute utilization is pretty low for the fused implementation
Layer Fusion
• The memory dependency
causes a significant stall in the
progress. (~27%).
1.5
• Achieve ~2x speedup of 1
Layer 1 Layer 2
Layer Fusion – Analytical Model
We developed an analytical model to calculate additional
computations and when this tile based approach is effective.
Say, we have ‘SHRD’ amount of shared memory.
Intermediate layer’s tile can be only as big as to fit in this shared
memory.
For ‘M’ output channels in intermediate layer:
SHRD = S2*M*4
‘SxS’ is the intermediate layer’s tile size
‘4’ to consider size of float.
‘M’ is the number of channels in intermediate layer.
Layer Fusion – Analytical Model
Considering, kernel for both layers is of dimension [M][M][K][K]
where both input and output number of channels is equal for
keeping the overall calculation simple.
For ‘SxS’ intermediate layer tile, we get:
Tile size of input layer: S+K-1
Tile size of output layer: S-K+1
Considering stride of ‘1’ and padding to input and output feature map
dimensions same.
The input, intermediate & output feature map will be of ‘MxH’x’W’
dimension.
Layer Fusion – Analytical Model
Thus number of computations
= No. of computation in a tile * number of tiles
Number of tiles =
As S>>K i.e. the intermediate tile size tends to be much higher than
kernel size, the additional computation tends to 0.
Layer Fusion for gradient update
• We applied the same fusion
approach to back propagation BP Fused vs non-fused implementation
8
gradient update. 7
Layer 1 Layer 2
• Thus, we have non-unit strided
access to memory.
Layer Fusion for BP – transpose weight matrix
BP - Fused approach with weight transpose
• The problem can be solved by 3
taking a transpose of weight matrix. 2.5
0
fused layer over the non-fused W/O Fusion With fusion
version and ~18% speed up overall. Layer 1 Layer 2 Weight transpose
Multi-GPU DNN training
• Neural networks grow deeper for accuracy
e.g., Resnet-152, Inception-v3, Faster R-CNN
• Large training dataset
e.g., ImageNet
• Single GPU takes several days/weeks to train the
network []
• Multi-GPU necessary to achieve faster training.
Multi-GPU training - Challenge
Communication and synchronization
• Data parallel approach chosen for
distribution of data (features and
network model) among multiple
GPUs as CNNs are main focus of
this project.
• Evaluation platform:
– 4 GTX 1080Ti connected to two NUMA
nodes (2 GPUs/NUMA node)
– GPU0 -> GPU1 – P2P access enabled
– GPU0/1 -> GPU 2/3 – Resolves to
normal memcpy procedure
Multi-GPU training - Analysis
• AlexNet chosen as evaluation
platform as it contains most of the
commonly used layers covering MULTI-GPU
reasonable configuration ranges COMMUNICATION
• Communication overhead on total 8%
AlexNet - 2 GPUs AlexNet - 4 GPUs
PERCENTAGE OVERHEAD
6%
• Increasing number of GPUs -> 4%
Increasing overhead; also involves
2%
communication between 2 NUMA
nodes via SMP 0%
16 32 64 128