Beruflich Dokumente
Kultur Dokumente
com/accelerated-
computing)
ProductionDeepLearningwithNVIDIAGPUInferencengine
Share:
(https://devblogs.nvidia.com/parallelforall/productiondeeplearningnvidiagpuinferenceengine/)
(https://devblogs.nvidia.com/parallelforall/wp
content/uploads/2016/06/GIEPerf_CPUvsGPUvsGIE.png)
Figure 1. NVIDIA GPU Inference Engine (GIE) provides
even higher efficiency and performance for neural
network inference. Tests performed using GoogLenet.
CPUonly: Singlesocket Intel Xeon (Haswell) E52698
v3@2.3GHz with HT.
GPU: NVIDIA Tesla M4 + cuDNN 5 RC.
GPU + GIE: NVIDIA Tesla M4 + GIE.
[Update September 13, 2016: GPU Inference Engine is now TensorRT (https://developer.nvidia.com/tensorrt)]
Today at ICML 2016, NVIDIA announced its latest Deep Learning SDK updates, including DIGITS 4 (https://developer.nvidia.com/digits), cuDNN
(https://developer.nvidia.com/cudnn) 5.1 (CUDA Deep Neural Network Library) and the new GPU Inference Engine
(https://developer.nvidia.com/gpuinferenceengine).
NVIDIA GPU Inference Engine (GIE) is a highperformance deep learning inference solution for production environments. Power efficiency and speed
of response are two key metrics for deployed deep learning (https://developer.nvidia.com/deeplearning) applications, because they directly
affect the user experience and the cost of the service provided. GIE automatically optimizes trained neural networks for runtime performance,
delivering up to 16x higher performance per watt on a Tesla M4 GPU compared to the CPUonly systems commonly used for inference today.
Figure 1 shows GIE inferenceperformance per watt of the relatively complex GoogLeNet running on a Tesla M4. GIE can deliver 20 Images/s/Watt
on the simpler AlexNet benchmark.
In this post, we will discuss how you can use GIE to get the best efficiency and performance out of your trained deep neural network on a GPU
based deployment platform.
Solving a supervised machine learning problem with deep neural networks involves a twostep process.
1. The first step is to train a deep neural network on massive amounts of labeled data using GPUs. During this step, the neural network
learns millions of weights or parameters that enable it to map input data examples to correct responses. Training requires iterative
forward and backward passes through the network as the objective function is minimized with respect to the network weights. Often
several models are trained and accuracy is validated against data not seen during training in order to estimate realworld performance.
2. The next stepinferenceuses the trained model to make predictions from new data. During this step, the best trained model is used in
an application running in a production environment such as a data center, an automobile, or an embedded platform. For some
applications, such as autonomous driving, inference is done in real time and therefore high throughput is critical.
To learn more about the differences between training and inference, see Michael Anderschs post on inference with GPUs
(https://devblogs.nvidia.com/parallelforall/inferencenextstepgpuaccelerateddeeplearning/).
The target deployment environment introduces various challenges that are typically not present in the training environment. For example, if the
target is an embedded device using the trained neural network to perceive its surroundings, then the forward inference pass through the model has
a direct impact on the overall response time and the power consumed by the device. The key metric to optimize is power efficiency: the inference
performance per watt.
Performance per watt is also the critical metric in maximizing data center operational efficiency. In this scenario, the need to minimize latency
and energy used on large volumes of geographically and temporally disparate requests limits the ability to form large batches.
There are two phases in the use ofGIE: build and deployment (See Figure 2). In the build phase, GIE performs optimizations on the network
configuration and generates an optimized plan for computing the forward pass through the deep neural network. The plan is an optimized object
code that can be serialized and stored in memory or on disk.
The deployment phase generally takes the form of a long running service or user application that accepts batches of input data, performs inference
by executing the plan on the input data and returns batches of output data (classification, object detection, etc). With GIE you dont need to install
and run a deep learning framework on the deployment hardware. Discussion of the batching and pipeline of the inference service is a topic for
another post; instead we will focus on how to use GIE for inference.
The GIE runtime needs three files to deploy a classification neural network:
In addition, you must define the batch size and the output layer. Code Listing 1 illustrates how to convert a Caffe model to a GIE object. The
builder (lines 47) is responsible for reading the network information. Alternatively, you can use the builder to define the network information if
you dont provide a network architecture file (deploy.prototxt).
Convolution: 2D
Activation: ReLU, tanh and sigmoid
Pooling: max and average
ElementWise: sum, product or max of two tensors
LRN: crosschannel only
Fullyconnected: with or without bias
SoftMax: crosschannel only
Deconvolution
You can also use the GIE C++ API to define the network without the Caffe parser, as Listing 2 shows. You can use the API to define any supported
layer and its parameters. You can define any parameter that varies between networks, including convolution layer weight dimensions and outputs
as well as the window size and stride for pooling layers.
After defining or loading the network, you must specify the output tensors as line 13 of Listing 1 shows; in our example the output is prob (for
probability). Next, define the batch size (line 16), which can vary depending on the deployment scenario. Listing 1 uses a batch size of 1 but you
may choose larger batch sizes to fit your application needs and system configuration. Underneath, GIE performs layer optimizations to reduce
inference time. While this is transparent to the API user, analyzing the network layers requires memory, so you must specify the maximum
workspace size (line 17).
The last step is to call buildCudaEngine to perform layer optimization and build the engine with the optimized network based on your provided
inputs and parameters. Once the model is converted to a GIE object it is deployable and can either be used on the host device or saved and used
elsewhere.
GIE performs several important transformations and optimizations to the neural network graph. First, layers with unused output are eliminated to
avoid unnecessary computation. Next, where possible convolution, bias, and ReLU layers are fused to form a single layer. Figure 4 shows the result
of this vertical layer fusion on the original network from Figure 3 (fused layers are labeled CBR in Figure 4). Layer fusion improves the efficiency of
running GIEoptimized networks on the GPU.
Figure 3. An example convolutional neural network with multiple convolutional and activation layers.
Figure 4. An example of vertical layer fusion on a convolutional neural network. Here, convolutional layers are combined with subsequent bias and activation (ReLU) layers.
Another transformation is horizontal layer fusion, or layer aggregation, along with the required division of aggregated layers to their respective
outputs, as Figure 5 shows. Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same
operations with similar parameters, resulting in a single larger layer for higher computational efficiency. The example in Figure 5 shows the
combination of 3 11 CBR layers from Figure 4 that take the same input into a single larger 11 CBR layer. Note that the output of this layer must
be disaggregated to feed into the different subsequent layers from the original input graph.
Figure 5. An example of horizontal layer fusion on a convolutional neural network. Here, multiple 11 CBR layers from Figure 4 are fused horizontally, or across similar layers
in the graph that share the same input.
GIE performs its transformations during the build phase transparently to the API user after the GIE parser reads in the trained network and
configuration file, as Listing 1 shows.
The inference builder (IBuilder) buildCudaEngine method returns a pointer to a new inference engine runtime object (ICudaEngine). This
runtime object is ready for immediate use; alternatively, its state can be serialized and saved to disk or to an object store for distribution. The
serialized object code is called the Plan.
As mentioned earlier, the full scope of batching and streaming data to and from the runtime inference engine is beyond the scope of this article.
Listing 3 demonstrates the steps required to use the inference engine to process a batch of input data to generate a result.
//Theexecutioncontextisresponsibleforlaunchingthe
Listing 3
//computekernels
IExecutionContext*context =engine >createExecutionContext ()
//Inordertobindthebuffers,weneedtoknowthenamesofthe
//inputandoutputtensors.
intinputIndex =engine >getBindingIndex (INPUT_LAYER_NAME ),
intoutputIndex =engine >getBindingIndex (OUTPUT_LAYER_NAME )
//AllocateGPUmemoryforInput/Outputdata
void*buffers =malloc (engine>getNbBindings ()*sizeof(void*))
cudaMalloc (&buffers [inputIndex ],batchSize *size_of_single_input )
cudaMalloc (&buffers [outputIndex ],batchSize *size_of_single_output )
//UseCUDAstreamstomanagetheconcurrencyofcopyingandexecuting
cudaStream_tstream
cudaStreamCreate (&stream)
//CopyInputDatatotheGPU
cudaMemcpyAsync (buffers [inputIndex ],input,
batchSize *size_of_single_input ,
cudaMemcpyHostToDevice ,stream )
//LaunchaninstanceoftheGIEcomputekernel
context .enqueue (batchSize ,buffers ,stream ,nullptr)
//CopyOutputDatatotheHost
GIE Performance
At the end of the day, the success of GIE comes down to the performance it provides for inference. To measure the performance benefits we
compared the perlayer timings of the GoogLeNet network using Caffe and GIE on NVIDIA Tesla M4 GPUs with a batch size of 1 averaged over 1000
iterations with GPU clocks fixed in the P0 state.
Figure 6. GIE + GPU vs. Caffe + GPU GoogLeNet layer execution time (lower is better).
The bar graph (Figure 6) is sorted to show the 10 most computationally expensive GoogLeNet layers (as run by Caffe) ordered from left to right as
light green bars. The dark green bars represent the same layers run using GIE (lower is better). Since GIE can combine layers both vertically and
horizontally into a single optimized kernel, the Caffe timing shown for each bar is the sum of Caffe kernels corresponding to each fused GIE kernel,
while the GIE timing for each bar is for a single fused and optimized kernel.
Bars with two layers correspond to vertically fused layers, namely CBR (convolution + bias + activation/relu). Bars with four layers (or whose
names contain || separating two CBR layer names) correspond to two CBRs that are horizontally fused, meaning two CBRs that share the same
input tensor and thus gain the advantage of cache reuse on a singlepass of the input tensor vs. two separate kernel launches with the same input
tensor. Unsurprisingly, the GIE kernels with four fused layers show some of the largest relative speedups and contribute to ~30% of the overall
speedup. The remainder of the speedup predominately comes from the two vertically fused CBR layers, which on average have a lower relative
speedup, but comprise the bulk of the computation.
If you are running web or mobile applications that are backed by data center servers, GIEs low overhead means that you can deploy more varied
and complex models to add intelligence to your product that will delight your users. If you are using deep learning to create the next generation of
smart devices, GIE helps you deploy networks with high performance, high accuracy, and high energy efficiency.
Moreover, GIE enables you to leverage the power of GPUs to perform neural network inference using mixedprecision FP16 data. Performing neural
network inference using FP16 can reduce memory usage by half and provide higher performance on Tesla P100 and Jetson TX1 GPUs.
GIE is currently being evaluated under an Early Access (EA) Program. To be notified when GIE is ready for public release or if you are interested in
participating in the EA program, please visit the GIE product page (https://developer.nvidia.com/gpuinferenceengine) to contact us today. To
learn more about neural network inference on GPUs, see Michael Anderschs post on inference with GPUs
(https://devblogs.nvidia.com/parallelforall/inferencenextstepgpuaccelerateddeeplearning/).
RE L ATE D POSTS
Deep Learning for Computer Vision with Caffe and cuDNN (https://devblogs.nvidia.com/parallelforall/deeplearningcomputervisioncaffe
cudnn/)
Share:
Allison Gray is a Solutions Architect in the Federal team at NVIDIA. She supports customers using GPUs for deep learning and
geospatial information systems.
View all posts by Allison Gray (https://devblogs.nvidia.com/parallelforall/author/agray/)
Chris Gottbrath is an Accelerated Computing Software Product Manager working to deliver products that help users accomplish
critical missions. Prior to NVIDIA he delivered software development tools to customers in the High Performance and Scientific
computing markets. He lead the introduction of CUDA debugging into the popular TotalView debugger used by computational
scientists around the world to create highly scalable parallel codes.
View all posts by Chris Gottbrath (https://devblogs.nvidia.com/parallelforall/author/cgottbrath/)
Ryan Olson is a Solutions Architect in the Worldwide Field Organization at NVIDIA. His primary responsibilities involve supporting
deep learning and high performance computing applications.
View all posts by Ryan Olson (https://devblogs.nvidia.com/parallelforall/author/rolson/)
Shashank Prasanna is a product marketing manager at NVIDIA where he focuses on deep learning products and applications. Prior to
joining NVIDIA, Shashank worked for MathWorks, makers of MATLAB, focusing on machine learning and data analytics, and for
Oracle Corp. designing and developing CRM software. Shashank holds an M.S. in electrical engineering from Arizona State
University.
View all posts by Shashank Prasanna (https://devblogs.nvidia.com/parallelforall/author/sprasanna/)
18Comments ParallelForall
1 Login
Jointhediscussion
alexeyamonthago
Hello!
ItriedtotestyoursTensorRTsampleswithmycaffenets.AndIrecievedthefollowingmessages.
1)IfmynetcontainsEltwiseMaxlayerthenerror:
"cudnnElementWiseLayer.cpp:51:virtualvoidnvinfer1::cudnn::ElementWiseLayer::execute(const
nvinfer1::cudnn::CommonContext&):Assertion`mParams.operation==ElementWiseOperation::kSUM'failed."
2)IfmynetcontainsTanHlayerthenerror:
"couldnotparselayertypeTanH
Enginecouldnotbecreated".
Here(https://devblogs.nvidia.com/pa...iswrittenthat"GIEsupportsthefollowinglayertypes.
Activation:ReLU,tanhandsigmoid
ElementWise:sum,productormaxoftwotensors"
Thisisexactlymycases.
Arethesetwolayersnotsupportedyetinfirstreleaseindeed?Isitmybugs?
Thankyou.
Reply Share
Chris>alexeyamonthago
Thanksverymuchforreportingthis!Thereisabugwitheltwiseandagapintheparserfortanh.Wehavebugsfiledfor
eachoftheseinourtrackingsystem.
Reply Share
PreetiBindu2monthsago
HowtobenchmarkGoogleNetperformanceforTensorRT?IhaveJetpack2.3installedonTX1,IcanrunImageNetclassification
butIcannotfindouttheimages/secperformance.
Reply Share
squidbot2monthsago
Whatoperatingsystemsaresupportedfordeployment?DoesthisonlyworkonLinux?IsWindows7or10supportedasa
deploymentOS?
Reply Share
Chris>squidbot2monthsago
Linuxrightnow.WearelookingatWindowssupportinafuturerelease.
Reply Share
BenGraf2monthsago
ThisarticlementionsthatGIEsupportsnetworkstrainedintensorflowandotherframeworks.However,thereleasecandidate
examplesfocusspecificallyonCaffeanditonlyhasaparserforCaffenetworkdefinitionfiles.Willtherebeaparserorsome
examplesfortensorflowinthefirstrelease?
Reply Share
Chris>BenGraf2monthsago
GIE(TensorRT)hasadocumentedAPIthatyoucanusetodescribeanetworkthatyoutrainedusinganyframework.
RightnowithasaparserwhichmakesitespeciallyeasytoimportamodelfromCaffe.Wewillhaveanexamplecodeas
partofthebundlethatwillshowusingtheAPItoexpressanetwork.
Reply Share
ZiMenglan3monthsago
cansomeonetellmehowtofindthedocumentortheusecode(example)ofTensorRT?it'ssowired.
Reply Share
Chris>ZiMenglan2monthsago
Signupforthereleasecandidatetestingatdeveloper.nvidia.com/tensorrtandwhenyougetthedownloadbundleitwill
containtheheaderfiles,afullsetofdocumentation,andthreeexampleprogramsthatyoucanexamine,buildandrun.
Reply Share
Djeb3monthsago
Thankyourforthegreatarticle.
Ihavejusttestedoutthelisting1codebyspecifyingthepathtothecaffemodelandthepathtoprototxtfile.UnfortunatelyIgota
segfaultinthecudaenginebuild(lastline).HowcanIdebugthis?
Reply Share
Chris>Djeb2monthsago
Djeb,sorryforthedelayinresponding.Isthisstillanissueorhaveyouresolvedit?
Reply Share
Djeb>Chris2monthsago
HiChris,Yeah,stillhavingthisissue.Antthoughts?Ihavecudnn5.1withaGTX1070.
Reply Share
KaiJia3monthsago
WillthenewfusedkernelsbeavailableincuDNN?
Reply Share
Thefusionoflayersisdynamic,sothefusedkernelsaregeneratedbyTensorRTatoptimizationtime.
Reply Share
KaiJia>MarkHarris3monthsago
Thanksforyourrapidreply!nvprofonGIEshowsthattherearekernelslike
`maxwell_scudnn_winograd_128x128_mobile_relu_tile148t_nt`andfornormaldeveloperswithoutaccesstocudnn
sourcecode,itwouldbedifficulttoimplementsuchkernelfusing.Theproblemisthatwedohaveourown
inferenceframeworkthatneedstobedeployedonallplatforms,notonlyGPU,andcompletelymigratingto
TensorRTisnotreallyfeasible.SowoulditbepossiblethatTensorRTprovidesalowerlevelAPI,likecuDNN,to
enableusersdirectlycallingthefusedkernels?Thanksalot!
Reply Share
WewillbereleasingfusedkernelsinafuturereleaseofcuDNN.
Reply Share
KaiJia>MarkHarris2monthsago
Reallygladtohearthat!Thanksverymuchforyoureffort:)
Reply Share
HiKai.WeareworkingonincludingthefusedkernelsinafuturereleaseofcuDNN.Thanks!
Reply Share
ALSOONPARALLELFORALL
DeepLearninginaNutshell:ReinforcementLearning FastMultiGPUcollectiveswithNCCL
3comments3monthsago 4comments8monthsago
AvatarLizaLoopThanksforthecomments,Carl.Idon'tthinkthe AvatarNathanLuehrNCCLwillautomaticallyfallbacktopinned
dataweneedexistsyetinuseableformbecausea)we buffersifP2Pdirectaccessisnotavailable.Thatsaid,the
TheIntersectionofLargeScaleGraphAnalyticsand OptimizingRecurrentNeuralNetworksincuDNN5
DeepLearning 4comments8monthsago
2comments2monthsago AvatarGoodluckAboutoptimization2,shouldn'tthehardware
AvatarJoeSchneibleWeareactivelylookingatreleasingalicense schedulertakecareofthatbitassumingthatthatthe
foracademiaandwanttodoitinpartnershipwiththe
GET STARTED
LEARN MORE
GET INVOLVED
Forums (https://devtalk.nvidia.com/)
Parallel Forall Blog (https://devblogs.nvidia.com/parallelforall/)
Developer Program (https://developer.nvidia.com/cuda-registered-developer-program)
Contact Us (https://developer.nvidia.com/contact)