Which GPU(s) To Get For Deep Learning

2/21/2018 Which GPU(s) to Get for Deep Learning
Tim Dettmers
Making deep learning accessible.
Which GPU(s) to Get for Deep Learning:

My Experience and Advice for Using
GPUs in Deep Learning
2017-04-09 by Tim Dettmers (http://timdettmers.com/author/tim-dettmers/)
Deep learning is a eld with intense computational requirements and the choice of
your GPU will fundamentally determine your deep learning experience. With no GPU
this might look like months of waiting for an experiment to nish, or running an
experiment for a day or more only to see that the chosen parameters were off. With a
good, solid GPU, one can quickly iterate over deep learning networks, and run
experiments in days instead of months, hours instead of days, minutes instead of hours.
So making the right choice when it comes to buying a GPU is critical. So how do you
select the GPU which is right for you? This blog post will delve into that question and
will lend you advice which will help you to make choice that is right for you.
TL;DR
Having a fast GPU is a very important aspect when one begins to learn deep learning as
this allows for rapid gain in practical experience which is key to building the expertise
with which you will be able to apply deep learning to new problems. Without this rapid
feedback it just takes too much time to learn from one’s mistakes and it can be
discouraging and frustrating to go on with deep learning. With GPUs I quickly learned
how to apply deep learning on a range of Kaggle competitions and I managed to earn
second place in the Partly Sunny with a Chance of Hashtags Kaggle competition using a
deep learning approach (https://www.kaggle.com/c/crowd ower-weather-

twitter/forums/t/6488/congratulations/35640#post35640), where it was the task to
predict weather ratings for a given tweet. In the competition I used a rather large two
layered deep neural network with recti ed linear units and dropout for regularization
http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/ 1/388
layered deep neural network with recti ed linear units and dropout for regularization
and this deep net tted barely into my 6GB GPU memory.
Should I get multiple GPUs?

Excited by what deep learning can do with GPUs I plunged myself into multi-GPU
territory by assembling a small GPU cluster with In niBand 40Gbit/s interconnect. I
was thrilled to see if even better results can be obtained with multiple GPUs.
I quickly found that it is not only very dif cult to parallelize neural networks on multiple
GPUs ef ciently, but also that the speedup was only mediocre for dense neural
networks. Small neural networks could be parallelized rather ef ciently using data
parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of
Hashtags Kaggle competition received almost no speedup.
Later I ventured further down the road and I developed a new 8-bit compression
technique (https://arxiv.org/abs/1511.04561) which enables you to parallelize dense or
fully connected layers much more ef ciently with model parallelism compared to 32-bit
methods.
However, I also found that parallelization can be horribly frustrating. I naively

optimized parallel algorithms for a range of problems, only to nd that even with
optimized custom code parallelism on multiple GPUs does not work well, given the
effort that you have to put in . You need to be very aware of your hardware and how it
interacts with deep learning algorithms to gauge if you can bene t from parallelization
in the rst place.
(https://i1.wp.com/timdettmers.com/wp-content/uploads/2014/08/gpu-pic.jpg)
Setup in my main computer: You can see three GXT Titan and an In niBand card. Is this a good setup for
doing deep learning?
Since then parallelism support for GPUs is more common, but still far off from
universally available and ef cient. The only deep learning library which currently
implements ef cient algorithms across GPUs and across computers is CNTK which
uses Microsoft’s special parallelization algorithms of 1-bit quantization (ef cient) and
block momentum (very ef cient). With CNTK and a cluster of 96 GPUs you can expect
a new linear speed of about 90x-95x. Pytorch might be the next library which supports
ef cient parallelism across machines, but the library is not there yet. If you want to
parallelize on one machine then your options are mainly CNTK, Torch, Pytorch. These
library yield good speedups (3.6x-3.8x) and have prede ned algorithms for parallelism
on one machine across up to 4 GPUs. There are other libraries which support
parallelism, but these are either slow (like TensorFlow with 2x-3x) or dif cult to use for
multiple GPUs (Theano) or both.
If you put value on parallelism I recommend using either Pytorch or CNTK.
Using Multiple GPUs Without Parallelism

Another advantage of using multiple GPUs, even if you do not parallelize algorithms, is
that you can run multiple algorithms or experiments separately on each GPU. You gain
no speedups, but you get more information of your performance by using different
algorithms or parameters at once. This is highly useful if your main goal is to gain deep
learning experience as quickly as possible and also it is very useful for researchers, who
want try multiple versions of a new algorithm at the same time.
This is psychologically important if you want to learn deep learning. The shorter the
intervals for performing a task and receiving feedback for that task, the better the brain
able to integrate relevant memory pieces for that task into a coherent picture. If you
train two convolutional nets on separate GPUs on small datasets you will more quickly
get a feel for what is important to perform well; you will more readily be able to detect
patterns in the cross validation error and interpret them correctly. You will be able to
detect patterns which give you hints to what parameter or layer needs to be added,
removed, or adjusted.
So overall, one can say that one GPU should be suf cient for almost any task but that
multiple GPUs are becoming more and more important to accelerate your deep
learning models. Multiple cheap GPUs are also excellent if you want to learn deep
learning quickly. I personally have rather many small GPUs than one big one, even for
my research experiments.
So what kind of accelerator should I get? NVIDIA GPU, AMD

GPU, or Intel Xeon Phi?
NVIDIA’s standard libraries made it very easy to establish the rst deep learning
libraries in CUDA, while there were no such powerful standard libraries for AMD’s
OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so
NVIDIA it is. Even if some OpenCL libraries would be available in the future I would
stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very
large for CUDA and rather small for OpenCL. Thus, in the CUDA community, good
open source solutions and solid advice for your programming is readily available.
Additionally, NVIDIA went all-in with respect to deep learning even though deep
learning was just in it infancy. This bet paid off. While other companies now put money
and effort behind deep learning they are still very behind due to their late start.
Currently, using any software-hardware combination for deep learning other than
NVIDIA-CUDA will lead to major frustrations.
In the case of Intel’s Xeon Phi it is advertised that you will be able to use standard C
code and transform that code easily into accelerated Xeon Phi code. This feature might
sounds quite interesting because you might think that you can rely on the vast
resources of C code. However, in reality only very small portions of C code are
supported so that this feature is not really useful and most portions of C that you will
be able to run will be slow.
I worked on a Xeon Phi cluster with over 500 Xeon Phis and the frustrations with it had
been endless. I could not run my unit tests because Xeon Phi MKL is not compatible
with Python Numpy; I had to refactor large portions of code because the Intel Xeon Phi
compiler is unable to make proper reductions for templates — for example for switch
statements; I had to change my C interface because some C++11 features are just not
supported by the Intel Xeon Phi compiler. All this led to frustrating refactorings which I
had to perform without unit tests. It took ages. It was hell.
And then when my code nally executed, everything ran very slowly. There are bugs(?)
or just problems in the thread scheduler(?) which cripple performance if the tensor
sizes on which you operate change in succession. For example if you have differently
sized fully connected layers, or dropout layers the Xeon Phi is slower than the CPU. I
replicated this behavior in an isolated matrix-matrix multiplication example and sent it

to Intel. I never heard back from them. So stay away from Xeon Phis if you want to do
deep learning!
Fastest GPU for a given budget

TL;DR
Your st question might be what is the most important feature for fast GPU
performance for deep learning: Is it cuda cores? Clock speed? RAM size?
It is neither of these, but the most important feature for deep learning performance is
memory bandwidth.
In short: GPUs are optimized for memory bandwidth while sacri cing for memory
access time (latency). CPUs design to the the exact opposite: CPUs can do quick
computations if small amounts of memory are involved for example multiplying a few
numbers (3*6*9), but for operations on large amounts of memory like matrix
multiplication (A*B*C) they are slow. GPUs excel at problems that involve large
amounts of memory due to their memory bandwidth. Of course there are more
intricate differences between GPUs and CPUs, and if you are interested why GPUs are
such a good match for deep learning you can read more about it in my quora answer
(https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning/answer/Tim-
Dettmers-1?srid=JCef) about this very question.
So if you want to buy a fast GPU, rst and foremost look at the bandwidth of that GPU.
Evaluating GPUs via Their Memory Bandwidth

(https://i2.wp.com/timdettmers.com/wp-
content/uploads/2014/08/memory-bandwidth.png)
Comparison of bandwidth for CPUs and GPUs over time: Bandwidth is one of the main
reasons why GPUs are faster for computing than CPUs are.
Bandwidth can directly be compared within an architecture, for example the

performance of the Pascal cards like GTX 1080 vs. GTX 1070, can directly be
compared by looking at their memory bandwidth alone. For example a GTX 1080
(320GB/s) is about 25% (320/256) faster than a GTX 1070 (256 GB/s). However,
across architecture, for example Pascal vs. Maxwell like GTX 1080 vs. GTX Titan X
cannot be compared directly due to how different architectures with different
fabrication processes (in nanometers) utilize the given memory bandwidth differently.
This makes everything a bit tricky, but overall bandwidth alone will give you a good
overview over how fast a GPU roughly is. To determine the fastest GPU for a given
budget one can use this Wikipedia page
(https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series)
and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards
(900 and 1000 series) but older cards are signi cantly cheaper than the listed prices
(900 and 1000 series), but older cards are signi cantly cheaper than the listed prices –
especially if you buy those cards via eBay. For example a regular GTX Titan X goes for
around $550 on eBay.
Another important factor to consider however is that not all architectures are
compatible with cuDNN. Since almost all deep learning libraries make use of cuDNN
for convolutional operations this restricts the choice of GPUs to Kepler GPUs or
better, that is GTX 600 series or above. On top of that, Kepler GPUs are generally quite
slow. So this means you should prefer GPUs of the 900 or 1000 series for good
performance.
To give a rough estimate of how the cards perform with respect to each other on deep
learning tasks I constructed a simple chart of GPU equivalence. How to read this? For
example one GTX 980 is as fast as 0.35 Titan X Pascal, or in other terms, a Titan X
Pascal is almost three times faster than a GTX 980.
Please note that I do not have all these cards myself and I did not run deep learning
benchmarks on all of these cards. The comparisons are derived from comparisons of
the cards specs together with compute benchmarks (some cases of cryptocurrency
mining are tasks which are computationally comparable to deep learning). So these are
rough estimates. The real numbers could differ a little, but generally the error should
be minimal and the order of cards should be correct. Also note, that small networks
that under-utilize the GPU will make larger GPUs look bad. For example a small LSTM
(128 hidden units; batch size > 64) on a GTX 1080 Ti will not be that much faster than
running it on a GTX 1070. To get performance difference shown in the chart one needs
to run larger networks, say a LSTM with 1024 hidden units (and batch size > 64). This is
also important to keep in mind when choosing the GPU which is right for you.
(https://i2.wp.com/timdettmers.com/wp-content/uploads/2017/03/performance.jpg)
Rough performance comparisons between GPUs. This comparison is only valid for large workloads.
Cost Ef ciency Analysis

If we now plot the rough performance metrics from above and divide them by the costs
for each card, that is if we plot much bang you get for your buck, we end up with a plot
which to some degree re ects my recommendations.
(https://i2.wp.com/timdettmers.com/wp-content/uploads/2017/03/cost_ef ciency.jpg)
Cost ef ciency using rough performance metrics from above and Amazon price for new cards and eBay
price for older cards. Note that this gure is biased in many ways, for example it does not account for
memory.
Note however, that this measure of ranking GPUs is quite biased. First of all, this does
not take memory size of the GPU into account. You often will need more memory than
a GTX 1050 Ti can provide and thus while cost ef cient, some of the high ranking cards
are no practical solutions. Similarly, it is more dif cult to utilize 4 small GPU rather than
1 big GPU and thus small GPUs have a disadvantage. Furthermore, you cannot buy 16
GTX 1050 Ti to get the performance of 4 GTX 1080 Ti, you will also need to buy 3
additional computers which is expensive. If we take this last point into account the
chart looks like this.
(https://i2.wp.com/timdettmers.com/wp-
content/uploads/2017/03/normalized_cost_ef ciency.jpg)
Normalized cost ef ciency of GPUs which takes into account the price of other hardware. Here we
compare a full machine, that is 4 GPUs, along with a high end hardware (CPU, motherboard etc.) worth
$1500.
So in this case, which practically represents the case if you want to buy many GPUs,
unsurprisingly the big GPUs win since it is cheaper if you buy more cost ef cient
computer + GPU combinations (rather than merely cost ef cient GPUs). However, this
is still biased for GPU selection. It does not matter how cost ef cient 4 GTX 1080 Ti in
a box are if you have a limited amount of money and cannot afford it in the rst place.
So you might not be interested in how cost ef cient cards are but actually, for the
amount of money that you have, what is the best performing system that you can buy?
You also have to deal with other questions such as: How long will I have this GPU for?
Do I want to upgrade GPUs or the whole computer in a few years? Do I want to sell the
current GPUs in some time in the future and buy new, better ones?
So you can see that it is not easy to make the right choice. However, if you take a
balanced view on all of these issues, you would come to conclusions which are similar to
the following recommendations.
General GPU Recommendations

Generally, I would recommend the GTX 1080 Ti, GTX 1080 or GTX 1070. They are all
excellent cards and if you have the money for a GTX 1080 Ti you should go ahead with
that. The GTX 1070 is a bit cheaper and still faster than a regular GTX Titan X
(Maxwell). The GTX 1080 was bit less cost ef cient than the GTX 1070 but since the
GTX 1080 Ti was introduced the price fell signi cantly and now the GTX 1080 is able
to compete with the GTX 1070. All these three cards should be preferred over the GTX
980 Ti due to their increased memory of 11GB and 8GB (instead of 6GB).
A memory of 8GB might seem a bit small, but for many tasks this is more than
suf cient. For example for Kaggle competitions, most image datasets, deep style and
natural language understanding tasks you will encounter few problems.
The GTX 1060 is the best entry GPU for when you want to try deep learning for the
rst time, or if you want to occasionally use it for Kaggle competition. I would not
recommend the GTX 1060 variant with 3GB of memory, since the other variant’s 6GB
memory can be quite limiting already. However, for many applications the 6GB is
suf cient. The GTX 1060 is slower than a regular Titan X, but it is comparable in both
performance and eBay price of the GTX 980.
In terms of bang for buck, the 10 series is quite well designed. The GTX 1050 Ti, GTX
1060, GTX 1070, GTX 1080 and GTX 1080 Ti stand out. The GTX 1060 and GTX 1050
Ti is for beginners, the GTX 1070 and GTX 1080 a versatile option for startups, and
some parts of research and industry, and the GTX 1080 Ti stand solid as an all-around
high-end option.
I generally would not recommend the NVIDIA Titan Xp as it is too pricey for its
performance. Go instead with a GTX 1080 Ti. However, the NVIDIA Titan Xp still has
its place among computer vision researchers which work on large datasets or video
data In these domains every GB of memory counts and the NVIDIA Titan Xp just has
data. In these domains every GB of memory counts and the NVIDIA Titan Xp just has
1GB more than the GTX 1080 Ti and thus an advantage in this case. I would not
recommend the NVIDIA Titan X (Pascal) anymore, since the NVIDIA Titan Xp is faster
and almost the same price. Due to the scarcity of these GPUs on the market however, if
you cannot nd a NVIDIA Titan Xp that you can buy, you could also go for a Titan X
(Pascal). You might also be able to snatch a cheap Titan X (Pascal) from eBay.
If you already have GTX Titan X (Maxwell) GPUs an upgrade to NVIDIA Titan X (Pascal)
or NVIDIA Titan Xp is not worth it. Save your money for the next generation of GPUs.
If you are short of money but you you know that a 12GB memory is important for you
then there is also the GTX Titan X (Maxwell) from eBay as an excellent option.
However, most researchers do well with a GTX 1080 Ti. The one extra GB of memory is
not needed for most research and most applications.
I personally would go with multiple GTX 1070 or GTX 1080 for research. I rather run a
few more experiments which are a bit slower than running just one experiment which is
faster. In NLP the memory constraints are not as tight as in computer vision and so a
GTX 1070/GTX 1080 is just ne for me. The tasks I work on and how I run my
experiments determines the best choice for me, which is either a GTX 1070 or GTX
1080.
You should reason in a similar fashion when you choose your GPU. Think about what
tasks you work on and how you run your experiments and then try to nd a GPU which
suits these requirements.
The options are now more limited for people that have very little money for a GPU.
GPU instances on Amazon web services are quite expensive and slow now and no
longer pose a good option if you have less money. I do not recommend a GTX 970 as it
is slow still rather expensive even if bought in used condition ($150 on eBay) and there
is slow, still rather expensive even if bought in used condition ($150 on eBay) and there
are memory problems associated with the card to boot. Instead, try to get the
additional money to buy a GTX 1060 which is faster, has a larger memory and has no
memory problems. If you just cannot afford a GTX 1060 I would go with a GTX 1050 Ti
with 4GB of RAM. The 4GB can be limiting but you will be able to play around with
deep learning and if you make some adjustments to models you can get good
performance. A GTX 1050 Ti would be suitable for most Kaggle competitions although
it might limit your competitiveness in some competitions.
The GTX 1050 Ti in general is also a solid option if you just want to try deep learning
for a bit without any serious commitments.
Amazon Web Services (AWS) GPU instances

In the previous version of this blog post I recommended AWS GPU spot instances, but I
would no longer recommend this option. The GPUs on AWS are now rather slow (one
GTX 1080 is four times faster than a AWS GPU) and prices have shot up dramatically in
the last months. It now again seems much more sensible to buy your own GPU.
Conclusion
With all the information in this article you should be able to reason which GPU to
choose by balancing the required memory size, bandwidth in GB/s for speed and the
price of the GPU, and this reasoning will be solid for many years to come. But right now
my recommendation is to get a GTX 1080 Ti, GTX 1070, or GTX 1080, if you can afford
them; a GTX 1060 if you just start out with deep learning or you are constrained by
money; if you have very little money, try to afford a GTX 1050 Ti; and if you are a
computer vision researcher you might want to get a Titan Xp.
TL;DR advice
Best GPU overall (by a small margin): Titan Xp
Cost ef cient but expensive: GTX 1080 Ti, GTX 1070, GTX 1080
Cost ef cient and cheap: GTX 1060 (6GB)

I work with data sets > 250GB: GTX Titan X (Maxwell), NVIDIA Titan X Pascal, or
NVIDIA Titan Xp
I have little money: GTX 1060 (6GB)
I have almost no money: GTX 1050 Ti (4GB)
I do Kaggle: GTX 1060 (6GB) for any “normal” competition, or GTX 1080 Ti for “deep
learning competitions”
I am a competitive computer vision researcher: NVIDIA Titan Xp; do not upgrade
from existing Titan X (Pascal or Maxwell)
I am a researcher: GTX 1080 Ti. In some cases, like natural language processing, a GTX
1070 or GTX 1080 might also be a solid choice — check the memory requirements of
your current models
I want to build a GPU cluster: This is really complicated, you can get some ideas here
(http://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-
deep-learning/)
I started deep learning and I am serious about it: Start with a GTX 1060 (6GB).
Depending of what area you choose next (startup, Kaggle, research, applied deep
learning) sell your GTX 1060 and buy something more appropriate
I want to try deep learning, but I am not serious about it: GTX 1050 Ti (4 or 2GB)
Update 2017-04-09: Added cost ef ciency analysis; updated recommendation with

NVIDIA Titan Xp
Update 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple neural network
memory section as no longer relevant; expanded convolutional memory section;
truncated AWS section due to not being ef cient anymore; added my opinion about the
Xeon Phi; added updates for the GTX 1000 series

Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the
comparison relation
Update 2015 04 22: GTX 580 no longer recommended; added performance
Update 2015-04-22: GTX 580 no longer recommended; added performance

relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs
Acknowledgements
I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX
970; I want to thank Sander Dieleman for making me aware of the shortcomings of my
GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for
pointing out software dependency problems for the GTX 580; and I want to thank
Oliver Griesel for pointing out notebook solutions for AWS instances.
[Image source: NVIDIA CUDA/C Programming Guide

(http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz3AI18t18Z)]
Share this:
 (http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/?share=twitter&nb=1)
 (http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/?share=facebook&nb=1)
630
 (http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/?share=google-plus-1&nb=1)
Related
How To Build and Use a Multi A Full Hardware Guide to Deep Deep Learning Hardware Limbo
GPU System for Deep Learning Learning (http://timdettmers.com/2017…
(http://timdettmers.com/2014… (http://timdettmers.com/2015… learning-hardware-limbo/)
learning-hardware-guide/) 2017-12-21
to-build-and-use-a-multi-gpu- 2015-03-09 In "Deep Learning"

system-for-deep-learning/) In "Hardware"
2014-09-21
In "Hardware"
Comments
Trace says
2014-09-28 at 10:53 (http://timdettmers.com/2017/04/09/which-gpu-for-deep-
learning/#comment-22)
How much slower mid-level GPUs are? For example, I have a Mac with GeForce
750M, is it suitable for training DNN models?
Reply (http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/?
replytocom=22#respond)
timdettmers (http://scitinker.wordpress.com) says

2014-09-28 at 11:33 (http://timdettmers.com/2017/04/09/which-gpu-for-
deep-learning/#comment-23)
There is a GT 750M version with DDR3 memory and GDDR5 memory; the
GDDR5 memory will be about thrice as fast as the DDR3 version. With a
GDDR5 model you probably will run three to four times slower than typical
desktop GPUs but you should see a good speedup of 5-8x over a desktop CPU
as well. So a GDDR5 750M will be suf cient for running most deep learning
models. If you have the DDR3 version, then it might be too slow for deep
learning (smaller models might take a day; larger models a week or so).
Skydeep (http://notavailable) says

2017-04-23 at 13:18 (http://timdettmers.com/2017/04/09/which-gpu-
for-deep-learning/#comment-14448)
Thanks a lot Mr.Tim D

You have a very lucid approach to answer complicated stuff, hope you
could point out what impact FloatingPoint 32 vs 16 make on speed up and
how does a 1080ti stack up against the Quadro GP100?
Reply (http://timdettmers.com/2017/04/09/which-gpu-for-deep-
learning/?replytocom=14448#respond)
Tim Dettmers says

2017-04-29 at 13:56 (http://timdettmers.com/2017/04/09/which-
gpu-for-deep-learning/#comment-14631)
A P100 chip, be it the P100 itself or the GP100, should be roughly 10-
30% faster than a Titan Xp. I do not know of any hard, unbiased data
on half-precision, but I think you could expect a speedup of about 75-
100% on P100 cards compared to cards with no FP16 support, such
as the Titan Xp.
Sriram says
You don’t need a really powerful GPU for Inference. Intel’s on-board graphics is
more than enough for getting real-time performance for most of the
applications (unless it is a high frame rate VR experience). For training, you
obviously need an NVIDIA card, but it is a one-time thing.
Lewis Cowles (@LewisCowles1) (http://twitter.com/LewisCowles1) says

is it any good for processing non-mathematical data or non- oating point via GPU?
How about the handling of generating hashes and keypairs?
timdettmers (http://scitinker.wordpress.com) says

Sometime it is good, but often it isn’t – it depends on the use-case. One

applications of GPUs for hash generation is bitcoin mining. However the main
measure of success in bitcoin mining (and cryptocurrency mining in general) is
to generate as many hashes per watt of energy; GPUs are in the mid- eld here,
beating CPUs but are beaten by FPGA and other low-energy hardware.
In the case of keypair generation, e.g. in mapreduce, you often do little
computation, but lots of IO operations so that GPUs cannot be utilized
ef ciently. For many applications GPUs are signi cantly faster in one case, but
not in another similar case, e.g. for some but not all regular expressions, and
this is the main problem why GPUs are not used in other cases.
James Dang (@JamesDanged) (http://twitter.com/JamesDanged) says

Hi, nice writeup! Are you using single or double precision oats? You said divide by
4 for the byte size, which sounds like 32 bit oats, but then you point out that the
Fermi cards are better than Kepler, which is more true when talking about double
precision than single, as the Fermi cards have FP64 at 1/8th of FP32 while Kepler is
1/24th. Trying to decide myself whether to go with the cheaper Geforce cards or to
spring for a Titan.
timdettmers (https://timdettmers.wordpress.com) says

Thanks for you comment James. Yes, deep learning is generally done with single
precision computation, as the gains in precision do not improve the results
greatly.
It depends what types of neural network you want to train and how large they
are. But I think a good decision would be to go for a 3GB GTX580 from ebay,
and then upgrade to a GTX 1000 series card next year. The GTX 1000 series
cards will probably be quite good for deep learning, so waiting for them might
be a wise choice.
enedene (http://gravatar.com/enedene) says

Thank you for the great post. Could you say something about having a new card on
order CPU?
For example I have 4 core Intel Q6600 from year 2007 with 8Gb of RAM (without
For example I have 4 core Intel Q6600 from year 2007 with 8Gb of RAM (without
possibility to upgrade). Could this be a bottleneck, if I choose to buy new GPU for
CUDA and ML?
I’m also not sure which one is a better choice GTX 780 2Gb of RAM, vs GTX 970
4Gb of RAM. 780 has more cores, but are a bit slower…
http://www.game-debate.com/gpu/index.php?
gid=2438&gid2=880&compare=geforce-gtx-970-4gb-vs-geforce-gtx-780
(http://www.game-debate.com/gpu/index.php?
gid=2438&gid2=880&compare=geforce-gtx-970-4gb-vs-geforce-gtx-780)
A nice list of characteristics, still, I’m not sure which would be a better choice. I
would use the GPU for all kind of problems, perhaps some with smaller networks,
but I wouldn’t be shy of trying something bigger when I feel conferrable enough.
What would you recommend?

Hi enedene, thanks for your question!
Your CPU should be suf cient and should slow you down only slightly (1-10%).
My post is now a bit outdated as the new Maxwell GPUs have been released.
The Maxwell architecture is much better than the Kepler architecture and so
The Maxwell architecture is much better than the Kepler architecture and so
the GTX 970 is faster than the GTX 780 even though it has lower bandwidth.
So I would recommend getting a GTX 970 over a GTX 780 (of course, a GTX
980 would be better still, but a GTX 970 will be ne for most things, even for
larger nets).
For low budgets I would still recommend a GTX 580 from eBay.
I will update my post next week to re ect the new information.
enedene (http://gravatar.com/enedene) says

Thank you for the quick reply. I will most probably get GTX 970. Looking
forward to your updated post, and competing against your on Kaggle.
Anatoly says
Hi Tim. What open-source package would you recommend if the objective was to
classify non-image data? Most packages speci cally are designed for classifying
images

I have only super cial experience with the most libraries, as I usually used my
own implementations (which I adjusted from problem to problem). However,
from what I know, Torch7 is a really strong for non-image data, but you will
need to learn some lua to adjust some things here and there. I think pylearn2 is
also a good candidate for non-image data, but if you are not used to theano
then you will need some time to learn how to use it in the rst place. Libraries
like deepnet – which is programmed on top of cudamat – are much easier to use
for non-image data, but the available algorithms are partially outdated and
some algorithms are not available at all.
I think you always have to change a few things in order to make it work for new
data and so you might also want to check out libraries like caffe and see if you
like the API better than other libraries. A neater API might outweigh the costs
for needing to change stuff to make it work in the rst place. So the best advice
might be just to look a documentations and examples, try a few libraries, and
then settle for something you like and can work with.
Monica says
Hi Tim. Do you have any references that explain why the convolutional kernels need
more memory beyond that used by the network parameters. I am trying to gure
out why Alex’s net needs just over 3.5Gb when the parameters alone only take ~0.4
Gb…what’s hogging the rest?!?

Thanks for your comment Monica. This is indeed something I overlooked,

which is actually a quite important issue when selecting a GPU. I hope to
address this in an update I aim to write soon.
To answer your question: The increase memory usage stems from memory that
is allocated during the computation of the convolutions to increase
computational ef ciency: Because image patches overlap one saves a lot of
computation when one saves some of the image values to then reused them for
computation when one saves some of the image values to then reused them for
an overlapping image patch. Albeit at a cost of device memory, one can achieve
tremendous increases in computational ef ciency when one does cleverly as
Alex does in his CUDA kernels. Other solutions that use fast Fourier
transforms (FFTs) are said to be even faster than Alex’s implementation, but
these do need even more memory.
If you are aiming to train large convolutional nets, then a good option might be
to get a normal GTX Titan from eBay. If you use convolutional nets heavily, two,
or even four GTX 980 (much faster than a Titan) also make sense if you plan to
use the convnet2 library which supports dual GPU training. However, be aware
that NVIDIA might soon release a Maxwell GTX Titan equivalent which would
be much better than the GTX 980 for this application.
Mike says
Hi Tim. Thanks for this very informative post.
Do you know how much of a boost Maxwell gives? I’m trying to decide between a
GTX 850M with 4GB DDR3 and a Quadro K1100M with 2GB GDDR5. I
understand that the K1100M is roughly equivalent to the 750M. Which gives the
bigger boost: going from Kepler to Maxwell or from Geforce to Quadro (including
bigger boost: going from Kepler to Maxwell or from Geforce to Quadro (including
from DDR3 to GDDR5)?
Thanks so much!

Going from DDR3 to GDDR5 is a larger boost than going from Kepler to
Maxwell. However, the Quadro K1100M has only a slightly faster bandwidth
than the GTX 850M which will probably cancel out the bene ts, so that both
cards will perform at about the same level. If you want to use convolutional
neural networks the 4GB memory on the GTX 850M might make the differnce;
otherwise I would go with the cheaper option.
Mike says
Thanks!
ragv says
Hi, I am planning to replicate ImageNet object identi cation problem using CNNs as
published in recent paper by G. Hinton et al… ( just as an exercise to learn about
deep learning and CNNs ).
1. What GPU would you recommend considering I am student. I heard the original
paper used 2 GTX 580 and yet took a week to train the 7 layer deep network? Is this
true? Could the same be done using a single GTX 580 or GTX 970? How much time
will it take to train the same on a GTX 970 or a single GTX 580 ? ( A week of time is
okay for me )
2. What kind of modi cations in the original implementation could I do ( like 5 or 6

hidden layers instead of 7, or lesser number of objects to detect etc. ), to make this
little project of mine easier to implement on a lower budget while at the same time
helping me learn about the deep nets and CNNs ?
3. What kind of libraries would you recommend for the same? Torch7 or pylearn2 /
theano ( I am fairly pro cient in python but not so much in lua ).
4. Is there a small scale implementation of this anywhere in github etc?
Also thanks a lot for the wonderful post.

1. All GPUs with 4 GB should be able to run the network; you can run a bit
smaller networks on one GTX 580; these networks will always take more than
5 days, even on the fastest GPUs
2. Read about convolutional neural networks, then you will understand what
the layers do and how you can use them. This is a good, thorough tutorial:
http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-
detect-facial-keypoints-tutorial/
(http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-
detect-facial-keypoints-tutorial/)
3. I would try pylearn2, convnet2, and caffe and pick which suits you best
4. The implementations are generally general implementations, i.e. you run
small and large networks with the same code, it is only a difference in a
parameters to a function; if you mean by “small”, a less complex API I heard
good things about the Lasagne library
Jack says

Hi Tim, super interesting article. What case did you use for the build that
Hi Tim, super interesting article. What case did you use for the build that
had the GPUs vertical?

It looks like it is vertical, but it is not. I took that picture while my

computer was laying on the ground. However, I use the Cooler Master
HAF X for both of my computer. I bought this tower because it has a
dedicated large fan for the GPU slot – in retrospect I am unsure if the
fan is helping that much. There is another tower I saw that actually has
vertical slots, but again I am unsure if that helps so much. I would
probably opt for liquid cooling for my next system. It is more dif cult
to maintain, but has much better performance. With liquid cooling
almost any case would go that ts the mainboard and GPUs.
Jack says
2015-02-04 at 19:33
(http://timdettmers.com/2017/04/09/which-gpu-for-deep-
It looks like there is a bracket supporting the end of the cards, did
that come with the case or did you put them in to support the
cards?
gwern (http://www.gwern.net) says

(Duplicate paragraph: “I quickly found”.)

Thanks, xed.
vikasing (http://endex.wordpress.com/) says

2015 02 25 t 10 16 (htt //ti d tt /2017/04/09/ hi h f d

Great article!
You did not talk about the number of cores present in a graphics card (CUDA cores
in case of nVidia). My perception was that a card with more cores will always be
better because more number of cores will lead to a better parallelism, hence the
training might be faster, given that the memory is enough. Plz correct me if my
understanding is wrong.
Which card would you suggest for RNNs and a data size of 15-20 Gb
(wikipedia/freebase size)? A 960 would be good enough? Or should I go with a 970
one? 580 is not available in my country.

Thanks for your comment. CUDA cores relate more closely to FLOPS and not
to bandwidth, but it is the bandwidth that you want for deep learning. So cuda
cores are a bad proxy for performance in deep learning. What you really want is
a high memory bus width (e.g. 384 bits) and high memory clock (e.g. 7000MHz)
– anything other than that hardly matters for deep learning.
Mat Kelcey did some tests with theano for the GTX 970 and it seems that the
GPU has no memory problems for compute – so the GTX 970 might be a good
choice then.
vikasing (http://endex.wordpress.com/) says

Thanks a lot
yakup says
Hi Tim,
Thanks for your excellent blog posts. I am a statistician and I want to go into deep
learning area. I have a budget of 1500-2000 $. Can you recommend me a good
desktop system for deep learning purposes? From your blog post I know that I will
get a gtx 980 but, what about cpu, ram, motherboard requirement?
Thanks

Hi Yakup,
I wanted to write a blog post with detailed advice about this topic sometimes in
the next two weeks and if you can wait for that you might get some insights
what hardware is right for you. But I also want to give you some general, less
speci c advice.
If you might be getting more GPUs in the future, it is better that you will buy a
motherboard with PCIe 3.0 and 7 PCIx16 slots (one GPU takes typically two
slots). If you will use only 1-2 GPUs, then almost any motherboard with do
(PCIe 2.0 would be also be okay). Plan to get a power supply unit (PSU) which
has enough Watts to power all GPUs you will get in the future (e.g. if you will
get a maximum of 4, then buy a +1400 Watts PSU). The CPU does not need to
be fast or have many cores. Twice as many threads as you have GPUs is almost
always suf cient (for Intel CPUs we mostly have: 1 core = 2 threads); any CPU
with more than 3GHz is okay; less than 3GHz might give you a tiny penalty in
speed of about 1-3%. Fast memory caches are often more important for CPUs,
but in the big picture they also contribute little in overall performance; a typical
CPU with slow memory will decrease the overall performance by a few percent.
One can work around a small RAM by loading data sequentially from your hard
drive into your RAM, but it is often more convenient to have a larger RAM; two
times the RAM your GPU has gives you more freedom and exibility (i.e. 8GB
RAM for a GTX 980) A SSD will it make more comfortable to work but
RAM for a GTX 980). A SSD will it make more comfortable to work, but
similarly to the CPU offers little performance gains (0-2%; depends on the
software implementation); a SSD is nice if you need to preprocess large
amounts of data and save them into smaller batches, e.g. preprocessing 200GB
of data and save them into batches of 2GB is a situation in which SSDs can save
a lot of time. If you decide to get a SSD, a good rule might be to buy a SSD that is
twice as large as your largest data set. If you get a SSD, you should also get a
large hard drive where you can move old data sets to.
So the bottom line is, a $1000 system should perform at least at 95% of a
$2000 system; but a $2000 system offers more convenience and might save
some time for preprocessing.
Dewan says
Hi Tim,
Nice and very informative post. I have a question regarding processor.

Would you suggest to build a computer with AMD processor (for example,
AMD FX-8350 4.0GHz 8-Core Processor) over INTEL based processor
for deep learning? I also do not know, AMD processor has PCI 3.0 support.
Could you please give your thought on this?
And thanks a lot for the wonderful post.

Thanks for your comment, Dewan. An AMD CPU is just as good as a

Intel CPU; in fact I might favor AMD over Intel CPUs because Intel
CPU pack just too much unnecessary punch – one simply does not
need so much processing power as all the computation is done by the
GPU. The CPU is only used to transfer data to the GPU and to start
kernels (which is little more than a function call). Transferring data
means that the CPU should have a high memory clock and a memory
controller with many channels. This is often not advertised on CPUs
as it not so relevant for ordinary computation, but you want to choose
the CPU with the larger memory bandwidth (memory clock times
memory controller channels). The clock on the processor itself is less
relevant here.
A 4GHz 8 core AMD CPU might be a bit overkill. You could de nitely
settle for less without any degradation in performance. But what you
say about PCIe 3.0 support is important (some new Haswell CPUs do
not support 40 lanes, but only 32; I think most AMD CPUs support all
40 lanes). As I wrote above I will write a more detailed analysis in a
week or two.
yakup says
Hey Tim,
Thanks for the excellent detailed post. I look forward to reading your other posts.
Keep going
elanmart (http://elantheyoung.wordpress.com) says

Hey! Thanks for the great post!
I have one question, however:
I’m in the “started DL, serious about it” group and have a decent PC already,
although without NVIDIA GPU. I’m also a 1st yr student, so GTX 980 is out of
question The question is: what do You think about Amazon EC2? I could easily
buy a GTX580 but I’m not sure if it’s the best way to spend my money And when I
buy a GTX580, but I m not sure if it s the best way to spend my money. And when I
think about more expensive cards (like 980 or the ones to be released in 2016) it
seems like running a spot instance for 10 cents per hour is a much better choice.
What could be the main drawbacks of doing DL on EC2 instead of my own

hardware?

I think a Amazon web services (AWS) EC2 instance might be a great choice for
you. AWS is great if you want to use a single or multiple separate GPUs (one
GPU for one deep net). However, you cannot use them for multi-GPU
computation (multiple GPUs for one deep net) as the virtualization cripples the
PCIe bandwidth; there are rather complicated hacks that improve the
bandwidth, but it is still bad. Everything beyond two GPUs will not work on
AWS because their interconnect is way to slow for that.
Gideon says

Is the AWS single GPU limitation relevant to the new g2.8xlarge instance?
Is the AWS single GPU limitation relevant to the new g2.8xlarge instance?
(see https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-
more-gpu-power/ (https://aws.amazon.com/blogs/aws/new-g2-instance-
type-with-4x-more-gpu-power/)).
Tim Dettmers (https://timdettmers.wordpress.com) says

It seems to run the same GPUs as those in the g2.2xlarge which would
still impede parallelization for neural networks, but I do not know for
sure without some hard numbers. I bet that with custom patches 4
GPU parallelism is viable although still slow (probably one GTX Titan X
will be faster than the 4 GPUs on the instance). More than 4 GPUs still
will not work due to the poor interconnect.
fayzur20 says
Thanks for the explanation. Looking forward to read the other post.
Soumyajit Ganguly says

Hi Tim,
I am a bit confused between buying your recommended GTX 580 and a new GTX
750 (maxwell). The models which I am getting in ebay are around 120 USD but they
are 1.5GB models. One big problem with the 580 would be, buying a new PSU
(500watt). As you stated the maxwell architecture is the best, then would the GTX
750 (512 CUDA cores, 1GB DDR5) be a good choice? It will be about 95 USD and I
can also do without an expensive PSU.
My research area is mainly in text mining and nlp, not much of images. Other than
this I would do Kaggle competetions.

A GTX 750 will be a bit slower than a GTX 580 which should be ne and more
A GTX 750 will be a bit slower than a GTX 580 which should be ne and more
cost effective in your case. However, maybe you want to opt for the 2 GB
version; with 1 GB it will be dif cult to run convolutional nets; 2 GB will also be
limiting of course, but you could use it on most Kaggle competitions I think.
vinhnguyenx (http://vinhnguyenx.wordpress.com) says

great posts, Tim!

which deep learning framework you often use for your work may I ask?

I programmed my own library for my work on the parallelization of deep

learning; otherwise I use Torch7 with which I am much more productive than
with Caffe or Pylearn2/Theano.
Ashkan says
what about Tensor ow?
Tim Dettmers says

The commend is quite outdated now. TensorFlow is good. I personally

favor PyTorch. I believe one can be much more productive with
PyTorch — at least I am.
InternetJ says

I know that these are not recommended, but 580 won’t work for me because of the
lack of Torch 7 support: will the 660 or 660 Ti work with Torch 7? Is this possible to
check before purchasing? Thank you!

The cuDNN component of Torch 7 needs a GPU with compute capability 3.5. A
660 or 660Ti will not work; You can nd out which GPUs have which compute
capability here (https://timdettmers.wordpress.com/wp-admin/comment.php?
action=editcomment&c=156).
Timothy Scharf (https://plus.google.com/110325506655575835983) says

Any comments on this new Maxwell architecture Titan X? $1000 US
Any comments on this new Maxwell architecture Titan X? $1000 US

http://www.pcworld.com/article/2898093/nvidia-fully-reveals-1000-titan-x-the-
most-advanced-gpu-ever.html (http://www.pcworld.com/article/2898093/nvidia-
fully-reveals-1000-titan-x-the-most-advanced-gpu-ever.html)
seemingly has a massive memory bandwidth bump – for example the gtx 980 specs
claim 224 GB/sec with the Maxwell architecture, this has 336 GB/sec (and also
comes stock with 12GB VRAM!)
Along that line, are the memory bandwith specs not apples to apples comparisons
across different Nvidia architectures?
i.e. the also 780ti claims 336GB/sec with the Kepler architecture – but you claim
the 980 with 224GB/sec bandwidth can out benchmark it for basic neural net
activities?
Appreciate this post

You can compare bandwidth within microarchitecture (Maxwell: GTX Titan X

vs GTX 980, or Kepler: GTX 680 vs GTX 780), but across architectures you
cannot do that (Maxwell card X vs Kepler card X). The very minute changes in
the design of a microchip can make vast difference in bandwidth FLOPS or
the design of a microchip can make vast difference in bandwidth, FLOPS, or

FLOPS/watt.
Kepler was about FLOPS/watt and double precision performance for scienti c
computing (engineering, simulation etc.), but the complex design lead to poor
utilization of the bandwidth (memory bus times memory clock). With Maxwell
the NVIDIA engineers developed an architecture which has both energy
ef ciency and good bandwidth utilization, but the double precision suffered in
turn — you just cannot everything. Thus Maxwell cards make great gaming and
deep learning cards, but poor cards for scienti c computing.
The GTX Titan X is so fast, because it has a very large memory bus width (384
bit), an ef cient architecture (Maxwell) and a high memory clock rate (7 Ghz) —
and all this in one piece of hardware.
adalyac (http://gravatar.com/adalyac) says

“a 6GB GPU is plenty for now” — don’t you get severely limited in the batch size
(like, 30 max) for 10^8+ parameter convnets (eg simonyan very deep, googlenet)?
although I think some DL toolkits are starting to come with functionality of updating
weights after >1 batch load/unload onto gpu, which I guess would result in
theoretically unlimited batch size, though not sure how this would impact speed?

This is a good point, Alex. I think you can also get very good results with conv
nets that feature less memory intensive architectures, but the eld of deep
learning is moving so fast, that 6 GB might soon be insuf cient. Right now, I
think one has still quite a bit of freedom with 6 GB of memory.
A batch and activation unload/load procedure would be limited by the ~8GB/s

bandwidth between GPU and CPU, so there will be de nitely a decrease in
performance if you unload/load a majority of the needed activation values.
Because the bandwidth bottlenecks are very similar to parallelism, one can
expect a decrease in performance of about 10-30% if you unload/load the
whole net. So this would be an acceptable procedure for very large conv nets,
however smaller nets with less parameters would still be more practical I think.
Mario B. says
What is your opinion about the different brands (EVGA, ASUS, MSI, GIGABYTE) of
the video card for the same model?
Thanks for this post Tim, is very illustrating.

EVGA cards often have many extra features (dual BIOs, extra fan design) and a
bit higher clock and/or memory, but their cards are more expensive too.
However, with respect to price/performance it often depends from card to card
which is the best one and one cannot make general conclusions from a brand.
Overall, the fan design is often more important than the clock rates and extra
features. The best way to determine the best brand, is often to look for
references of how hot one card runs compared to another and then think if the
price difference justi es the extra money.
Most often though, one brand will be just as the next and the performance
gains will be negligible — so going for the cheapest brand is a good strategy in
most cases.
Timothy Scharf (https://plus.google.com/110325506655575835983) says

hey Tim,
you been a big help – I have included the results from CUDA bandwidth test (which
is included in the samples le of the basic CUDA install.)
This is for a GTX 980 running on 64bit linux with i3770 CPU, and PCIe 2.0 lanes on
motherboard.
This look reasonable?

Are they indicative of anything?
the device/host and host/device speeds are typically the bottleneck you speak of?
no reply necessary – just learning
thanks again
tim@ssd-tim ~/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release $
./bandwidthTest
[CUDA Bandwidth Test] – Starting…
Running on…
Device 0: GeForce GTX 980

Quick Mode
Host to Device Bandwidth 1 Device(s)
Host to Device Bandwidth, 1 Device(s)

PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12280.8
Device to Host Bandwidth, 1 Device(s)

33554432 12027.4
Device to Device Bandwidth, 1 Device(s)

33554432 154402.9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results
may vary when GPU Boost is enabled.

Looks quite reasonable; the bandwidth from host to device, and device to host
is limited by either RAM or PCIe 2.0 and 12GB/s is faster than expected;
150GB/s is slower than the 224GB/s which the GTX 980 is capable of, but this
is due to the small memory size of 30MB so this looks ne
is due to the small memory size of 30MB — so this looks ne.
johno says
Hi Tim, great post! I feel lucky that I chose a 580 a couple of years ago when I
started experimenting with neural nets. If there had been an article like this then I
wouldn’t have been so nervous!
I’m wondering if you have any quick tips for fresh Ubuntu installs with current
nvidia cards? When I got my used system running a couple of years ago it took quite
a while and I fought with drivers, my 580 wasn’t recognized, etc.. On the table next
to me is a brand new build that I just put together that I’m hoping will streamline my
ML work. It’s an intel X99 system with a Titan X (I bought into the hype!). Windows
went on ne(although I will rarely use it) and Ubuntu will go on shortly. I’m not
looking forward to wrestling with drivers…so any tips would be greatly appreciated.
If you have a cheat-sheet or want to do a post, I’m sure it will be warmly welcomed
by many…especially me!

Yeah, I also had my troubles with installing the latest drivers on ubuntu, but
soon I got the hang of it. You want to do this:
0. Download driver and remember the path where you saved the le
1. Purge system from nvidia and nouveau driver
2. Blacklist nouveau driver
3. Reboot
4. Ctrl + Alt + F1
5. sudo service lightdm stop
5. chmod +x driver_ le
6. sudo ./driver le
And you should be done. Sometimes I had troubles with stopping lightdm; you
have two options:
1. try sudo /etc/init.d/lightdm stop
2. killing all lightdm processes (sudo killall lightdm or (1) ps ux | grep lightdm, (2)
nd process id, (3) sudo kill -9 id)
For me the second option worked.
You can nd more details to the rst steps here:

http://askubuntu.com/questions/451221/ubuntu-14-04-install-nvidia-driver
(http://askubuntu.com/questions/451221/ubuntu-14-04-install-nvidia-driver)
johno says
Thanks for the reply Tim. I was able to get it all up and running pretty
painlessly. Hopefully your response helps somebody else too…it’s nice to
have this sort of information in one spot if it’s GPU+DNN related.
On a performance note, my new system with the Titan X is more than 10

times faster on an MNIST training run than my other computer (i5-2500k
+ gx580 3Gb). And for fun, I cranked up the mini-batch size on a Caffe
example ( icker netuning) and got 41% validation accuracy in under an
hour.
I believe you hit a nerve with a couple of your blog posts…I think the type
of information that you’re giving is quite valuable, especially to folks who
haven’t done much of this stuff.
One possible information portal could be a wiki where people can outline
how they set up various environments (theano, caffe, torch, etc..) and the
associated dependencies. Myself, I set up a few and I’m left with a few
questions like for example…
-given all the dependencies, which should be build versus apt-get versus
pip? A holistic outlook would be a very education thing. I found myself
building the base libraries and using the setup method for many python
packages but after a while there were so many I started using apt-get and
pip and adding things to my paths…blah blah…at the end everything works
but I admin I lost track of all the details.
I know that I’m not alone!! Having a wiki resource that I could contribute
to during the process would be good for me and for others doing the same
thing….instead of hunting down disparate sources and answering questing
on stackover ow
on stackover ow.
I mention this because you probably already have a ton of traf c because
of a couple key posts that you have. Put a wiki up and I promise I’ll
contribute! I’ll consider doing it myself as well…time…need more time!
thanks again.

Thanks, johno, I am glad that you found my blog posts and comments
useful. A wiki is a great idea and I am looking into that. Maybe when I
move this site to a private host this will be easy to setup. Right now I
do not have time for that, but I will probably migrate my blog in a two
months or so.
Timothy Scharf says

I am a bit of a novice but got it done in a few hours.
My thoughts
Try and start with a clean install of a NVIDIA supported linux distro ( unbuntu
14.04 LTS) is on there
I used the Linux distro proprietary drivers, instead of downloading them from
NVIDIA. X-org-edgers PPA has them and they keep them pretty current. This
means you can install the actual NVIDIA driver via sudo apt-get, and also (more
importantly) upgrade the driver in a few months when NVIDIA easily. It also
blacklists Nouveau automatically. You can toggle between driver versions in the
software manager as it shows you all the drivers you have.
Once you have the driver working, you are most of the way there. I ran into a
few troubles with the CUDA install, as sometimes your computer may have
some libraries missing, or con icts. But I got CUDA 7_0 going pretty quickly
these two links helped
http://bikulov.org/blog/2015/02/28/install-cuda-6-dot-5-on-clean-ubuntu-14-
dot-04/ (http://bikulov.org/blog/2015/02/28/install-cuda-6-dot-5-on-clean-
ubuntu-14-dot-04/)
http://developer.download.nvidia.com/compute/cuda/6_0/rel/docs/CUDA_Get
ting_Started_Linux.pdf
(http://developer.download.nvidia.com/compute/cuda/6_0/rel/docs/CUDA_Ge
tting Started Linux pdf)
tting_Started_Linux.pdf)
there is gonna be some trial and error, be ready to reinstall ubuntu and take
another try at it.
good luck
Jack says
Hi Tim-
Does the platform you plan on DLing on matter? By this I mean x99, z97, AM3+, ect.
X99 is able to utilize more threads and cores than z97, but I’m not sure if that helps
at all, similar to cryptocurrency mining, where hardware besides the GPU dosent
matter.

Hi Jack-
Hi Jack
Please have a look at my full hardware guide
(https://timdettmers.wordpress.com/2015/03/09/deep-learning-hardware-
guide/) for details, but in short, hardware besides the GPU does not matter
much (although a bit more than in cryptocurrency mining).
Jack says
Ok, sure, thanks.
yakup says
Hi Tim,
I have bene ted from this excellent post. I have a question regarding amazon gpu
instances. Can you give a rough estimate of the performance of amazon gpu? Like
GTX TITAN X = ? amazon gpu
GTX TITAN X = ? amazon gpu.
Thanks,

Thanks, this was a good point, I added it to the blog post. The new AWS GPUs
(g2.2 and g2.8 large) are about as fast as a GTX 680 (they are based on the
same chip, but are slightly different to support virtualization). However, there
are still some performance decreases due to virtualization for the memory
transfer from CPU to GPU and between GPUs; this is hard to measure and
should have little impact if you use just one GPU. If you perform multi-GPU
computing the performance will degrade harshly.
Dimiter (http://gravatar.com/dmilush) says

Hi Tim,
Hi Tim,
Thanks for sharing all this info.
I don’t understand the difference between GTX 980 from say Asus and Nvidia.
Obviously same architecture, but are they much different at all?
Why it seems hard to nd Nvidia products in Europe?
Thanks

So this is the way how a GPU is produced and comes into your hands:
1. NVIDIA designs a circuit for a GPU
2. It makes a contract with a semiconductor producer (currently TSMC in
Taiwan)
3. The semiconductor producer produces the GPU and sends it to NVIDIA
4. NVIDIA sends the GPU to companies such as ASUS, EVGA etc.
5. ASUS, EVGA, etc. modify the GPU (clock speeds, fan — nothing fundamental,
the chip stays the same)
6. You buy the GPU from either 5. or 4.
So while all GPUs are from NVIDIA you might buy a branded GPU from, say,
ASUS. This GPU is the very same GPU as another GPU from, say, EVGA. Both
GPUs run the very same chip. So essentially, all GPUs are the same (for a given
chip).
Some GPUs are not not available in other countries because of regulations
(NVIDIA might have no license but other brands have?) and because it might
(NVIDIA might have no license, but other brands have?) and because it might
not be pro table to sell it there in the rst place (you will need to have a
different set of logistics for international trade; NVIDIA might not have the
expertise and infrastructure for this, but regular hardware companies like
ASUS, EVGA do).
salemameen says
Hi Tim,
Thank you for your advices I found them very very useful. I have many questions
please and feel very to answer some of them. I have many choices to buy a powerful
laptop or computer My budget is (£4000.00).
I would like to buy Mac Pro (cost nearly £3400.00) , so can I apply deep learning of
this machine as it uses the OSX operating system and I want to use torch7 in my
implementation. Second, I will buy Titan x then I have two choices, First, I will install
TITAN X GPU in Mac Pro. Second, I will buy Alienware Ampli er (to use TITAN X)
with Alienware 13 laptop. Could you please tell me if this possible and easy to make
it because I am not a computer engineer, but I want to use deep learning in my
research.
Best regards,
Salem

I googled the Alienware Ampli er and I read it only has 4GB/s of bandwidth
internal and it might be that there are other problems. If you use a single GPU,
this is not too much of a concern, but be prepared to deal with performance
decreases in the range of 5-25%. If there are technical details that I overlooked
the performance decrease might be much higher — you will need to look into
that yourself.
The GTX Titan X in a Mac Pro will do just ne I guess. While most deep learning
libraries will work well with OSX there might be a few problems here and there,
but I think torch7 will work ne.
However, consider also that you will pay a heavy price for the aesthetics of
apple products. You could buy a normal high end computer with 2 GTX Titan X
and it will be still cheaper than a Mac Pro. Ubuntu or any other Linux-based OS
need some time to get comfortable with, but they work just as well as OSX and
often make programming easier than OSX does. So it is basically all down to
aesthetics vs performance — that’s your call!
salemameen says
Is it easy to install GTX Titan X in a Mac Pro? Does it need external

hardware or power supply or just plug in?
salemameen says
Many thanks Tim
salemameen says

Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or
Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or
power supply or just plug in?
vasconl (http://gravatar.com/vascovnl) says

Hi,
Nice article! You recommended all high-end cards. What about mid-range cards for
those with a really tight budget? For example, the GT 740 line has a model with 4GB
gddr5, 5000 mt/s mem clock, a 128 bus width and is rated at ~750 GFLOPS. Will
such a card likely give a nice boost in neural net training (assuming it ts in the cards
mem) over a mid-range quad-core CPU?
Thanks!

The GTX 740 with 4GB GDDR5 is a very good choice for a low budget. Maybe I
should even include that option in my post for a very low budget. A GT 740 will
de nitely be faster than quad-core CPUs (probably anything between 3 to 7
times faster depending on the CPU and the problem)
times faster, depending on the CPU and the problem).
Jay says
Thanks for this great article. What do you think of the upcoming GTX 980 Ti? I have
read it has 6GB and clock speed/cores closer to the Titan X. Rumoured to be $650-
750. I was about to buy a new PC, but thought I might hold out as it’s coming in
June.

The GTX 980 Ti seems to be great. 6GB of RAM is suf cient for most tasks
(unless you use super large data sets, doing video classi cation, and use
expensive convolutional architectures) and the speed is about the same. If you
use Nervana System 16 bit kernels (which will be integrated into torch7) then
there should be no issues with memory even with these expensive tasks.
So the GTX 980 Ti seems to be the new best choice in terms of cost
So the GTX 980 Ti seems to be the new best choice in terms of cost
effectiveness.
Kumar says
Hi,
I am a novice at deep nets and would like to start with some very small convolutional
nets. I was thinking of using a GTX750TI (in my part of the world it is not really very
cheap for a student). I would convince my advisor to get a more expensive card after
I would be able to show some results. Will it be suf cient to do a meaning
convolutional net using Theano?

Your best choice in this situation will be to use an amazon web service GPU
spot instance. These instances have small costs ($0.1 and hour or so) and you
will able to produce results quickly and cheaply, after which your advisor might
be willing to buy an expensive GPU To save more costs it would be best to
be willing to buy an expensive GPU. To save more costs, it would be best to

prototype your solution on a CPU (just test that the code is working correctly)
and then start up an AWS GPU instance and let your code run for a few
hours/days. This should be the best solution.
I think there are prede ned AWS images which you can load, so that you do not
have to install anything — google “AMI AWS + theano” or “AMI AWS + torch” to
nd more.
Kumar says
Thanks a lot for the suggestion. I will go ahead and try this.
Jay says

Will the Pascal GPUs have any special requirements, such as X99 or DDR4? I am
Will the Pascal GPUs have any special requirements, such as X99 or DDR4? I am
currently planning a Z97 build with DDR3, but don’t want to be stuck in a years
time! Thanks, J

According to the information that is available

(http://www.kitguru.net/components/graphic-cards/anton-shilov/nvidia-
pascal-architectures-nvlink-to-enable-8-way-multi-gpu-capability/), Pascal will
not need X99 or DDR4 (which would be quite limiting for sales), instead Pascal
cards will just be like a normal card stuck in a PCIe slot with NVLink on top (just
like SLI) and thus no new hardware is needed.
Jay says
Sweet, thanks.
mmm says
About this:
“GTX Titan X = 0.66 GTX 980 = 0.6 GTX 970 = 0.5 GTX Titan = 0.40 GTX 580
GTX Titan X = 0.35 GTX 680 = 0.35 AWS GPU instance (g2.2 and g2.8) = 0.33 GTX
960”
Have you actually measured the times/used these gpus or are you “guessing”?
Thank you for the article!

Very good question!
Because deep learning is bandwidth-bound, the performance of a GPU is

determined by its bandwidth. However, this is only true for GPUs with the
same architecture (Maxwell, Kepler, Fermi). So for example: The comparisons
between GTX Titan X and GTX 980 should be quite accurate
between GTX Titan X and GTX 980 should be quite accurate.
Comparisons across architectures are more dif cult and I cannot assess them
objectively (because I do not have all the GPUs listed). To provide a relatively
accurate measure I sought out information where a direct comparison was
made across architecture. Some of these are opinion or “feeling”-based, other
sources of information are not relevant (game performance measures), but
there are some sources of information which are relatively objective
(performance measures for bandwidth-bound cryptocurrency mining); so I
weighted each piece of information according to its relevance and then I
rounded everything to neat numbers for comparisons between architectures.
So all in all, these measure are quite opinionated and do not rely on good
evidence. But I think I can make more accurate estimates than people that do
not know GPUs well. Therefore I think it is the right thing to include this
somewhat inaccurate information here.
Dmitry says
Hi,
Thanks a lot for the updated comparison. I bought a 780 Ti a year ago and
it’s interesting how it compares to the newer cards? I use it for NLP tasks
mainly, including RNNs, starting with LSTMs.
Also, do I get it right that ‘GTX Titan X = 0.66 GTX 980’ means that 980 is
actually 2/3 as fast as Titan X or the other way round?
Tim Dettmers says

A GTX 780 Ti is pretty much the same as a GTX Titan Black in terms of
performance (slower than a GTX 980). Exactly, the 980 is about 2/3
the speed of a Titan X.
need some says

Can you comment on this note on the cuda-convnet page

https://code.google.com/p/cuda-convnet/wiki/Compiling
(https://code.google.com/p/cuda-convnet/wiki/Compiling)
?
“Note: A Fermi-generation GPU (GTX 4xx, GTX 5xx, or Tesla equivalent) is required
to run this code. Older GPUs won’t work. Newer (Kepler) GPUs also will work, but
as the GTX 680 is a terrible, terrible GPU for non-gaming purposes, I would not
recommend that you use it. (It will be slow). ”
I am probably in the “started DL, serious about it”-group, and would have probably
bought the GTX 680 after reading your (great) article.

This is very much true. The performance of the GTX 680 is just bad. But
because the Fermi GPUs (4xx and 5xx) are not compatible with the NVIDIA
cuDNN library which is used by many deep learning frameworks, I do not
recommend the GTX 5xx. The GTX 7xx series is much faster, but also much
more expensive than a GTX 680 (except the GTX 960, which is about as fast as
the GTX 680), so the GTX 680 despite being so slow, is the only viable choice
(besides GTX 960) for a very low budget.
As you can see in the comment of zeecrux, the GTX 960 might actually be
better than the GTX 680 by quite a margin. So probably it is better to get a GTX
960 if you nd a cheap on. If this is too expensive, settle for a GTX 580.
need some says

Ok, thank you! I can’t see any comment by zeecrux ? How bad is the
performance of the GTX 960 ? Is it suf cient to have if you mainly want to
get started with DL, play around with it, do the occasional kaggle comp, or
is it not even worth spending the money in this case ? Buying a Titan X or
GTX 980 is quite an investment for a beginner ?

Ah I did not realize, the comment of zeecrux was on my other blog

post, the full hardware guide. Here is the comment:
“ ImageNet on K40:
Training is 19.2 secs / 20 iterations (5,120 images) – with
cuDNN
and GTX770:
cuDNN Training: 24.3 secs / 20 iterations (5,120 images)
(source:
http://caffe.berkeleyvision.org/performance_hardware.htm
l
(http://caffe.berkeleyvision.org/performance_hardware.ht
ml))
I trained ImageNet model on a GTX 960 and have this

result:
Training is around 26 secs / 20 iterations (5,120 images) –
with cuDNN
A K40 is about as fast as a GTX Titan. So the GTX 960 is de nitely

faster and better than a GTX 680. It should be suf cient for most
kaggle competitions and is a perfect card to get startet with deep
learning.
So it makes good sense to buy a GTX 960 and wait for Pascal to arrive
in Q3/Q4 2016, instead of buying a GTX 980 Ti or GTX 980 now.
tommy says
Hey Tim,
Can i know where to check this statement? “But because the Fermi GPUs
(4xx and 5xx) are not compatible with the NVIDIA cuDNN library “. TIA.
Tim Dettmers says

Check this stackover ow answer

(https://stackover ow.com/questions/37461988/does-cudnn-library-
works-with-all-nvidia-graphic-cards) for a full answer and source to
that question.
Haider says
Hi Tim,
Hi Tim,
Do you think it is better to buy Titan X now or waiting the new Pascal if I want to
invest in just one GPU withing the coming 4 years?

The Pascal architecture should be a quite large upgrade when compared to

Maxwell. However, you have to wait more than a year for them to arrive. If your
current GPU is okay, I would wait. If you have no GPU at all, you can use AWS
GPU instances, or but a GTX 970, and sell it after one year, to buy a Pascal card.
serige (http://gravatar.com/serige) says

From what I read, GPU Direct RDMA is only available for workstation cards
(Quadro/Tesla). But it seems like you are able to get your cluster to work with a few
GTX Titan’s and IB cards here. Not sure what am I missing.

You will need a Mellanox In niBand card. For me a ConnectX-2 worked, but
usually only ConnectX-3 and ConnectX-IB are supported. I never tested GPU
Direct RDMA with Maxwell, so it might not work there.
To get it working on Kepler devices, you will need the patch you nd under
downloads here (nvidia_peer_memory-1.0-0.tar.gz):
http://www.mellanox.com/page/products_dyn?product_family=116
(http://www.mellanox.com/page/products_dyn?product_family=116)
Even with that I needed quite some time to con gure everything, so prepare
yourself for a long read of documentations and error google search queries.
Joe Hoover says

Hi Tim, thank you for posting and updating this, I’ve found it very helpful.
Hi Tim, thank you for posting and updating this, I ve found it very helpful.
I do have a general question, though, about quadro cards, which I’ve noticed neither
you nor many others discuss using for deep learning. I’m con guring a new machine
and, due to some administrative constraints, it is easiest to go with a quadro k5000.
I had specced out a different machine with a GTX 980, but it’s looking like it will
harder to purchase it. My questions are whether there is anything I should be aware
of regarding using quadro cards for deep learning and whether you might be able to
ball park the performance difference. We will probably be running moderately sized
experiments and are comfortable losing some speed for the sake of convenience;
however, if there would be a major difference between the 980 and k5000, then we
might need to reconsider. I know it is dif cult to make comparisons across
architectures, but any wisdom that you might be able to share would be greatly
appreciated.
Thanks!

The k5000 is based on a Kepler chip and has 173 GB/s memory bandwidth.
Thus is should be a bit slower than a GTX 680.
Falak says
Hi!
I am in a similar situation. No comparison of quadro and geforce available
anywhere. Just curious, which one did you end up buying and how did it work
out?
Vu Pham (https://www.facebook.com/app_scoped_user_id/1691380981083432/)
says
Hi Tim, 1st i want to say that I’m truly extremely impressed with your blog, its very
helpful.
Talking about the bandwidth of PCI Ex, have u ever heard about plx tech with their
pex 8747 bridge (Chip). Anandtech has a good review on how does it work and
effect on gaming: http://www.anandtech.com/show/6170/four-multigpu-z77-
boards-from-280350-plx-pex-8747-featuring-gigabyte-asrock-ecs-and-evga
(http://www.anandtech.com/show/6170/four-multigpu-z77-boards-from-280350-
plx-pex-8747-featuring-gigabyte-asrock-ecs-and-evga). They even said that it can
also replicate 4 x16 lanes on a cpu which is 28lanes
also replicate 4 x16 lanes on a cpu which is 28lanes.

Someone mentioned it before in the comments, but that was another

mainboard with 48x PCIe 3.0 lanes; now that you say you can operate with 16x
on all four GPUs I got curious and looked at the details.
It turns out that this chip switches the data in a clever way, so that a GPU will
have full bandwidth when it needs high speed. However, when all GPUs need
high speed bandwidth, the chip is still limited by the 40 PCIe lanes that are
available at the physical level. When we transfer data in deep learning we need
to synchronize gradients (data parallelism) or output (model parallelism) across
all GPUs to achieve meaningful parallelism, as such this chip will provide no
speedups for deep learning, because all GPUs have to transfer at the same
time.
Transferring the data one after the other is most often not feasible, because we
need to complete a full iteration of stochastic gradient descent in order to work
on the next iterations. Delaying updates would be an option, but one would
suffer losses in accuracy and the updates would not be that ef cient anymore
(4 delayed updates = 2-3 real updates?). This would make this approach rather
useless.
Vu Pham
(https://www.facebook.com/app_scoped_user_id/1691380981083432/)
says
thank for your detailed explanation.
Alvas (http://alvations.com) says

Is it possible to use the GTX 960M for Deep Learning?

http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-
960m/speci cations (http://www.geforce.com/hardware/notebook-gpus/geforce-
gtx-960m/speci cations). It has 2.5GB GDDR though. Maybe a pre-built specs with
http://t.co/FTmEDrJDwb (http://t.co/FTmEDrJDwb) ?
Reply (http://timdettmers com/2017/04/09/which gpu for deep learning/?

A GTX 960M will be comparable in performance to a GTX 950. So you should

see a good speedup using this GPU, but it will not be a huge speedup compared
to other GPUs. However, compared to laptop CPUs the speedup will still be
considerable. To do more serious deep learning work on a laptop you need
more memory and preferably faster computation; a GTX 970M or GTX 980M
should be very good for this.
naveen DN (https://plus.google.com/112747792732213642656) says

Hi Tim
I’m planning to build a pc mainly for kaggle and getting started with deep learning.
This is my rst time.For my budget I’m thinking of going with
i7-4790k
GTX 960 4GB
Gigabyte GA-Z97X-UD3H-BK or Asus Z97-A 32GB DDR3 Intel Motherboard
I’m wishing to replace the gtx 960 or add another card later on …
Is this is a good build ? please offer your suggestions
Thanks in advance:)

Looks like a solid cheap build with one GPU. The build will suf ce for a Pascal
card once it comes available and thus should last about 4 years with a Pascal
upgrade. The GTX 960 is a good choice to try things out, and use deep learning
on kaggle. You will not able to build the best models, but models that are
competitive with the top 10% in deep learning kaggle competitions. Once you
get the hang of it, you can upgrade and you will be able to run the models that
usually win those kaggle competitions.
Vu Pham (https://www.facebook.com/app_scoped_user_id/1691380981083432/)
says
Hi Tim,
Right now i’m in between 2 choices: 2 gtx 690 and a Titanx. Both come with same
price. Which one do you think is better for conv net? Or Multimodal Recurrent
Neural Net

I would de nitely pick a GTX Titan X over two GTX 690, mainly because using
two GTX 690 for parallelism is dif cult and will be slower than a single Titan X.
Running multiple algorithms (different algorithms on each GPU) on the two
GTX 960 will be good, but a Titan X comes close to this due to its higher
processing speed.
Bjarke says
Are there any important differences between the GTX 980 and the GTX 980 TI? It
seems that we can only get the latter. While it seems faster, I’m not skilled enough in
the area to know whether it has any issues related to using it for deep learning.

The GTX 980 Ti is as fast at the GTX Titan X (50% faster than GTX 980), but
has 6GB of memory instead of 12GB. There are no issue with the card, it
should work awlessly.
Dong Ta says
What do you think of Titan X superclocked vs. regular Titan X? Are the up/down
sides noticeable?

The upgrade should be unnoticeable (0-5% increased speed) and I would

recommend a superclocked version only if you do not pay any additional money
for that.
Sergei Wallace says

Possibly (probably) a dumb question but can you use a superclocked GPU
with an non-superclocked GPU? Reason I ask is that a cheap used
superclocked Titan Black is for sale on ebay as well as another cheap Titan
Black (non-superclocked). Just want to make sure I wouldn’t be making
some mistake by buying the second one if I decided to get two Titan black
GPUs.
p.s. thanks for the blog. Super helpful for all of us noobies.
Tim Dettmers says

Yes, this will work without any problem. I myself have been using 3
different kind of GTX Titan for many months. In deep learning the
different of compute clock also makes hardly a difference, so that the
GPUs will not diverge during parallel computation. So there should be
no problems.
Mattias Johansson says

Hello Tim
Thank you very much for you in-depth hardware analysis (both this and the other
one you did). I basically ended up buying a new computer based only on your ideas
I choose the GTX 960 and then I might upgrade next year if I feeling this is
something for me.
But in a lot of places I read about this imagenet db The problem there seems to be
But in a lot of places I read about this imagenet db. The problem there seems to be
that i need to be a researcher (or in education) to download the data. Do you know
anything about this? Is there any way for me as a private person (that is doing this
for fun) to download the data? The reason why I want this dataset is because it is
huge and it also would be fun to be able to compare how my nets works compared
to other people.
If not, what other image databases except for CIFAR and MIST do you recommend?
Thanks agan.

Hello Mattias, I am afraid there is no way around the educational email address
for downloading the dataset. It is really is a shame, but if these images would be
exploited commercially then the whole system of free datasets would break
down — so it is mainly due to legal reasons.
There are other good image datasets like the google street view house number
(http://u dl.stanford.edu/housenumbers/) dataset; you can also work with
Kaggle datasets that feature images, which has the advantage that you get
immediate feedback how well you do and the forums are excellent to read up
how the best competitors did receive their results.
Mattias Johansson says

Thanks for quick reply,

I will look into both Kaggle and the street view data set then
Michael Holm says

Hello Tim,
Thank you for your article. I understand that researchers need a good GPU for
training a top performing (convolutional) neural network. Can you share any
thought on what compute power is required (or what is typically desired) for
transfer learning (i.e. ne tuning of an existing model) and for model deployment?
Thank you!
Tony says
Tim, Such a great article. I’m going back and forth between the titan z and the titan
x. I can probably buy the titan z for ~$500 from my friend. I’m very confused as to
how much memory it actually has. I see that it has 6gb x 2.
I guess my question is: Is the Titan Z have the same specs as the Titan X in terms of
memory? How does this work from a deep learning perspective (currently using
theano)
Many Thanks,
Tony says
One thing I should add is that I’m building RNN’s (speci cally LSTM’s) with this
Titan Z or Titan X. I’m also considering the 980 TI

Please have a look at my answer on quora (https://www.quora.com/Nvidia-

GeForce-GTX-TITAN-X-vs-TITAN-Z-Which-one-is-better-for-deep-learning-
and-computer-vision-applications) which deals exactly with this topic. Basically,
I recommend you to go for the GTX Titan X. However, $500 for GTX Titan Z is
also a good deal. Memory-wise, you can think of the GTX Titan Z, as two normal
GTX Titan with a connection between the two GPUs — so two GPUs with 6GB
of memory each.
Tony says
That makes much more sense. Thanks again — checked out your response
on quora. You’ve really changed my views on how to set up deep learning
systems. Can’t even begin to express how thankful I am.
Tony says
Hey Tim, not to bother too much. I bought a 980 Ti, and things have been great.
However, I was just doing some searching, and saw that the AMD Radeon R9 390X
is ~$400 on Newegg and has 8gb memory and 500gb bandwidth. These specs are
roughly 30% better than the 980-TI for $650.
I was wondering what your thoughts are on this? Is AMD compute architecture
slower compared to Nvidia Kepler architecture for deep learning? In the next month
or so, I’m considering purchasing another card.
Based upon numbers, it seems that the AMD cards are much cheaper compared to
Nvidia. I was hoping you could comment on this!

Theoretically the AMD card should be faster, but the problem is the software:
Since no good software exists for AMD cards you will have to write most of the
code yourself with an AMD card. Even if you manage to implement good
convolutions the AMD card will likely perform worse than the NVIDIA one
because the NVIDIA convolutional kernels have been optimized by a few dozen
researchers for more than 3 years.
NVIDIA Pascal cards will have up to 750-1000 GB/s memory bandwidth, so it is

worth waiting for Pascal which probably will be released in about a year.
Tony says
Yea — I can’t wait for Pascal. For now, will just rock out with the 980 TI’s.
Thanks alot!
Nghia says
Hi Tim,
Come across the internet for deep learning on this blog is great for newbie like me.
I have 2 choices in hands now: 1 GTX 980 4GB and 2 GTX 780 ti 3GB SLI. Which
one do you recommend that should come to the hardware box for my deep learning
research?
research?
I am more favour of 2 780 Ti as learning from your writing on CUDA cores +
memory bandwidth.
Thank you very much.

Nghia
Tim Dettmers says

I would favor the GTX 980 which will be much faster than 2 GTX 780 Ti even if
you use the two cards in parallel. However, the 2 GTX 780 Ti will much better if
you run independent algorithms and thus enables you to learn how to train
deep learning algorithms successfully more quickly. On the other hand, the
3GB on them is rather limiting and will prevent you to train current state of the
art convolutional networks. If you want to train convolutional networks I would
suggest you choose the GTX 980 rather than 2 GTX 780 due to this.
Nghia says

Thank you very much for the advice.
Thank you very much for the advice.

Is it possible to put all three card into one machine and that give me
enough environment to learn parallelism programming and study deep
learning with neuron network (torch7 & Lua)?
A system with those 3 cards (780Ti x2 + 980 x1) will yield better
performance overall or drag it down due to the hardware imparity and
complexity?
Tim Dettmers says

Yes, you could run all three cards in one machine. However you can
only select one type of GPU for your graphics; and for parallelism only
the two 780 will work together. There might be problems with the
driver though, and it might be that you need to select your Maxwell
card (980) to be your graphics output.
In a three card system you could tinker with parallelism with the 780s
and switch to the 980 if you are short on memory. If you run
NervanaGPU you could also use 16-bit oating point models, thus
doubling your memory, however, NervanaGPU will not work on your
Kepler 780 cards.
Nghia Tran says

2015-10-30 at 12:21
Thank you very much, Tim.

For the sake of study,
From the specs:
+ The GTX 780Ti with 2880 CUDA cores + 3GB (384bit

bandwidth), and double that with SLI
+ The GTX 980 with 2048 CUDA cores + 4GB (256bit

bandwidth).
Does VRAM 1GB/core difference make a big deal in deep

learning?
I will benchmark and post the result once I got hand on to run the
system with above 2 con guration.
Mister says
learning/#comment 836)
I have access to a NVIDIA Grid K2 card on a virtual machine and I have some
questions related to this:
1. How does this card rank compared to the other models?

2. More importantly, are there any issues I should be aware of when using this card
or just doing deep learning on a virtual machine in general?
I do not have the option of using any other machine than the one provided.
Mister says
And of course, thanks for some great articles! They are a big help.
Tim Dettmers says

You are welcome! I am glad that it helped!
Tim Dettmers says

1. The Grid K2 card will roughly perform as good as a GTX 680, although its
PCIe connection might be crippled due to virtualization.
2. Depends highly on the hardware/software setup. Generally there should not
be any issue other than problems with parallelism.
Mister says
Do you know what versions of CUDA it is comparible with? Would it work

with CUDA 7.5?
Alex (http://olorin.ru) says

Hi! Fantastic article. Are there any on demand solution such as Amazon but with
980Ti on board? I can’t nd any.
Tim Dettmers says

Amazon needs to use special GPUs which are virtualizable. Currently the best
cards with such capability are kepler cards which are similar to the GTX 680.
However, other vendors might have GPU servers for rent with better GPUs (as
they do not use virtualization), but these server are often quite expensive.
Marc-Philippe Huget says

Hello Tim,
Hello Tim,
First of all, I bounced on your blog when looking for Deep Learning con guration
and I loved your posts that con rm my thoughts.
I have two questions if you have time to answer them:

(1) For speci c problems, I will train my DNN with ImageNet with some other
classes, for this, I don’t care waiting for a while (well, a long while), when the DNN
will be ready, do you know if con guration (one to four Titan X, 12GB each) will not
delay too much when scene labelling images. I would like to have answers by
seconds like Clarifai does. I guess this is dependent of the number of hidden layers I
could have in my DNN
(2) Do you have enough long use of your con guration to provide feedback on
MTBF for GPU cards? I guess like disks, running a system on 24/7 basis will impact
the longevity of GPU cards…
Thanks in advance for your answers
mph
Tim Dettmers says

(1) Yes, this is highly dependent on the network architecture and it is dif cult to
say more about this. However, this benchmark page
(https://github.com/soumith/convnet-benchmarks) by Soumith Chintala might
give you some hint what you can expect from your architecture given a certain
give you some hint what you can expect from your architecture given a certain
depth and size of the data. Regarding parallelization: You usually use LSTMs for
labelling scenes and these can be easily parallelized. However, running image
recognition and labelling in tandem is dif cult to parallelize. You are highly
dependent on implementations of certain libraries here because it cost just too
much time to implement it yourself. So I recommend to make your choice for
the number of GPUs dependent on the software package you want to use.
(2) I had no failures so far — but of course this is for a sample size of 1. I have
heard from other people that use multiple GPUs that they had multiple failures
in a year, but I think this is rather unusual. If you keep the temperatures below
80 degrees your GPUs should be just ne (theoretically).
Stas says
Awesome work, this article really clears out the questions I had about available
GPU options for deep learning.
What can you say about the Jetson series, namely the latest TX1?
Is it recommended to get as an alternative to PC rig with desktop GPU’s?
Tim Dettmers says

I was also thinking about the idea to get a Jetson TX1 instead of a new laptop,
but in the end it is more convenient and more ef cient to have a small laptop
and ssh into a desktop or an AWS GPU instance. A AWS GPU instance will be
quite a bit faster than the Jetson TX1 so that the Jetson only makes sense if you
really want to do mobile deep learning, or if you want to prototype algorithms
for future generation of smartphones that will use the Tegra X1 GPU.
Alexander says
Hi Tim!
Thank you for the excellent blog post.
I use various neural nets (i.e. sometimes large, sometimes small) and hesitate to
choose between GTX 970 and GTX 960. What is better if we set price factor aside?
– 970 is ~2x faster than 960, but as you say it has troubles.
– on the other hand, Nvidia had shown that GTX 980 has the same memory
troubles > 3.5GB
http://www pcper com/news/Graphics Cards/NVIDIA Responds GTX 970 35GB
http://www.pcper.com/news/Graphics-Cards/NVIDIA-Responds-GTX-970-35GB-
Memory-Issue (http://www.pcper.com/news/Graphics-Cards/NVIDIA-Responds-
GTX-970-35GB-Memory-Issue)
If we take their information for granted, I don’t understand your point re. memory
troubles in GTX 970 at all, because you do recommend GTX 980
Simply put, is GTX 970 still faster than GTX 960 on large nets or not? What
concrete troubles we face using 970 on large nets?
Thank you again, Alexander
Tim Dettmers says

Hi Alexander,
if you look at the screenshots again you see that the bandwidth for the GTX
980 does not drop when we increase the memory. So the GTX 980 does not
have memory problems.
Regarding your question of 960 vs. 970: The 970 is much better if you can stay
below 3.5GB of memory, but much worse otherwise. If you train sometimes
some large nets, but you are not insisting on very good results (rather you are
satis ed with good results) I would go with the GTX 970. If you train something
big and hit the 3.5GB barrier, just adjust your neural architecture to be a bit
smaller and you should be alright (or you might try different things like 16-bit
networks or aggressive use of 1×1 convolutional kernels (inception) to keep
networks, or aggressive use of 1×1 convolutional kernels (inception) to keep

the memory footprint small).
Alexander says
Thanks, Tim!
Indeed, I overlooked the rst screenshot, it makes a difference.
Don’t understand Nvidia’ statement still, somehow they equated GTX 980
and 970 above 3.5 GB, but no matter.
Alexander says
Hey Tim!
I was thinking about GTX 970 issue again. According to the test, it loses bandwidth
above 3.5GB. But what does it mean exactly?
-does it start affecting bandwidth for memory below 3.5GB as well? (I guess no)
-does it decrease GPU computing performance itself? (I guess no)
-what if input data allocated in GPU memory below 3.5GB, and only CNN weights
allocated above 3.5GB? In that case upper 0.5GB shouldn’t be in use for data
exchange and may not affect overall bandwidth? I understand we don’t control this
allocation by default, but what in theory?
Hossein says
Great post ,
I bought a GTX 750, considering your article I’m doomed right?
I have a question though, I have’nt tested this yet, but here it goes.
Do you think I can use VGGnet or Alex Krizhevsky net for Cifar10? GTX750 has 2G
of ram and it GDDR5. CIFAR10 is only 60K, and of size 32*32*3! maybe it ts?! Im
not sure.
What do you think on this ? May I be able to give pascal voc 2007 as well?
Thanks again
Tim Dettmers says

The GTX 750 will be a bit slow, but you should still be able to do some deep
learning with it. If you are using libraries that support 16bit convolutional nets
then you should be able to train Alexnet even on ImageNet; so CIFAR10 should
not be a problem. To use VGG on CIFAR10 should work out, but maybe it might
be a bit tight especially if you use 32bit networks. I have no experience with the
PASCAL VOC2007 dataset, but the image sizes seem to be similar to
ImageNet, thus AlexNet should work out, but probably not VGG, even with
16bits.
Hossein says
Thanks you very much .

By the way I’m using Caffe and I guess it only supports 32 bit convnets. I’m
already hitting my limit using a 4 conv layer network (1991Mbs or so ) and
overall only 2~3 Mbs of GPU remains .
Your article and help was of great help to me sir and I thank you from the
bottom of my heart .
God bless you
Hossein
Pawel Kozela says

Hi Tim,
thanks for great guidelines !
In case somebody’s interested in the numbers – I’ve just bought a GTX 960
(http://www.gigabyte.com/products/product-page.aspx?pid=5400#ov
(http://www.gigabyte.com/products/product-page.aspx?pid=5400#ov)) and I’m
getting ~50% better performance than AWS G2.2 instance (keras / tensor ow
backend).
Tim Dettmers says

Thank you, Pawel. That are very useful statistics!
Thank you, Pawel. That are very useful statistics!
Wajahat says
Hi Tim
Thanks a lot for this article. I was looking for something like this.
I have a quick question. What would be the expected speedup for ConvNets with a
GTX Titan X vs Core i7 4770-3.4 Ghz?
A rough idea would do the job?
Best Regards
Wajahat
Hossein says

I wonder what exactly happens when we exceed the 3.5G limit of the GTX970?!
I wonder what exactly happens when we exceed the 3.5G limit of the GTX970?!
Will it crash? if not how much slower does it get when it passes that limit?
I want to know, if passing the limit and getting slower, would it still be faster than
the GTX960 ? If it is so , that would be great.
Has anyone ever observed or benchmarked this ? Have you?
Thanks again
Hossein says
Thanks alot, actually I dont want to play with this card, I need its bandwidth and its
memory to run some applications (a deep learning Framework called caffe ).
Currently I have a GTX750 2G GDDR5, I need 4Gig at least . at the very same time,
I also need a higher bandwidth card.
I cant buy the GTX980, its too expensive for me, I was skeptical to go for the
GTX960 4G or the GTX970 4G (3.5G).
basically, GTX960 is 128 bit and it gives me 112 G of bandwidth, while the GTX970
is 256 and gives me 192+G bandwidth.
My current cards bandwidth is only 80!
So I just need to know, Do I have access to the whole 4 gigabyte of vram? playing
games aside?
Does it crash if it exceeds the 3.5G limit or it just gets slower?
Hossein says
I mistakenly posted this here ;! ( This was supposed to be in techpowerup!)
Cheer says
I am using GTX 970 and two 750 (with 1GD5, 2GD5)

But there is not big difference in speed.
Rather, it seems 750 is slightly faster than 970.
Would you tell me the reason?
Thanks.
Tim Dettmers says

Hmm this seems strange. It might be that the GTX 970 hit the memory limit
Hmm this seems strange. It might be that the GTX 970 hit the memory limit
and thus is running more slowly so that it gets overtaken by a GTX 750. On
what kind of task have you tested this?
Vu Pham says
Hi Tim, Do you know (or recommend) any good DL project for “Graphic card
testing” on Github? Recently I’m cooperating with a hardware retailer so they lend
me bunch of NVIDIA graphic cards (titan, titanx, titan black, titan z, 980 ti, 980, 970,
780 Ti, 780…).
Steve says
What if I install a single gtx 960 to a PCIe 2.0 slot instead of a 3.0?
Tim Dettmers says

It will be a bit slower to transfer data to the GPU, but for deep learning this is
negligible. So not really a problem.
Amir H. Jadidinejad says

Thank you for sharing this. Please update the list with new Tesla P100 and compare
it with TitanX.
Tim Dettmers says

I will probably do this on the weekend.
I will probably do this on the weekend.
Wajahat says
Hi Tim
Thanks a lot for sharing such valuable information.
Do you know if it will be possible to use and external GPU enclosure for deep
learning
such as a Razer core?
http://www.anandtech.com/show/10137/razer-core-thunderbolt-3-egfx-chassis-
499399-amd-nvidia-shipping-in-april
(http://www.anandtech.com/show/10137/razer-core-thunderbolt-3-egfx-chassis-
499399-amd-nvidia-shipping-in-april)
Would there be any compromise on the ef ciency?
Best Regards
Wjahat
Tim Dettmers says

There will be a penalty to get the data from your CPU to your GPU, but the
performance on the GPU will not be impacted. Depending on the software and
the network you are training, you can expect a 0-15% decrease in performance.
This should still be better than the performance you could get for a good laptop
GPU.
Richard says
What can I expect from a Quadro M2000M (see

http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html
(http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html)) with
4GB RAM in a “I started deep learning and I am serious about it” situation?
Tim Dettmers says

It will be comparable to a GTX 960.
Haider says
Hi
I keep coming to this great article. I was about to buy a 980 ti only when discovered
that today nvidia announced the pascal gtx 1080 to be released in the end of may
2016. Maybe you want to put an update to your article with this fantastic
performance and price of gtx 1080/1070.
Tim Dettmers says

I will update the blog post soon. I want to wait until some reliable performance
statistics are available.
Haider says
By the way, the price difference between Asus, EVGA,… etc. vs the original nvidia
seems pretty high. Titan x in Amazon priced around 1300 to 1400 usd vs 999usd in
nvidia online store. Do you advise against buying the original nvidia? If yes, why?
What is the difference? Which brand you prefer?
Many thanks Tim. Your posts are unique. We badly need hardware posts for deep
learning!
Tim Dettmers says

For deep learning the performance of the NVIDIA one will be almost the same
as ASUS, EVGA etc (probably about 0-3% difference in performance). The
brands like EVGA might also add something like dual-boot BIOS for the card,
but otherwise it is the same chip So de nitely go for the NVIDIA one
but otherwise it is the same chip. So de nitely go for the NVIDIA one.
Haider says
I read this interesting discussion about the difference in reliability, heat

issues and future hardware failures of the reference design cards vs the
OEM design cards:
https://hashcat.net/forum/thread-4386.html
(https://hashcat.net/forum/thread-4386.html)
The opinion was strongly against buying the OEM design cards. Especially
for computing and 24/7 working of GPUs.
I read all the 3 pages and it seems there is no citation or any scienti c study
backing up the opinion, but it seems he has a rst hand of experience who
bought thousands of NVidia cards before.
So what is your comment about this? Should we avoid OEM design cards
and stick with the original NVidia reference cards?
Hayder Hussein says

Answering my own question above:
I asked the same question to the author of this blog post (Matt Bach)
of Puget systems and he was kind to answer based on around 4000
Nvidia cards that they have installed in his company:
https://www.pugetsystems.com/labs/articles/Most-Reliable-PC-
Hardware-of-2016-872/
(https://www.pugetsystems.com/labs/articles/Most-Reliable-PC-
Hardware-of-2016-872/)
I will quote the discussion happened in the comments of the above

article, in case anybody is interested:
Matt Bach :
Interesting question and one that is a bit hard to answer since we
don’t really track individual cards by usage. I will tell you, however, that
we lean towards reference cards if the card is expected to be put
under a heavy load or if multiple cards will be in a system. Many of the
3rd party designs like the EVGA ACX and ASUS STRIX series don’t
have very good rear exhaust so the air tends to stay in the system and
you have to vent it with the chassis fans. That is ne for a single card,
but as soon as you stack multiple cards into a system it can produce a
lot of heat that is hard to get rid of. The Linus video John posted in
reply to your comment lines up pretty closely what we have seen in
our testing.
I did go ahead and pull some failure numbers from the last two years.
This is looking at all the reference cards we sold (EVGA, ASUS, and
PNY mostly) versus the EVGA ACX and ASUS STRIX cards (which are
the only non reference cards we tend to sell):
the only non-reference cards we tend to sell):
Total Failures: Reference 1.8%, EVGA ACX 5.0%, ASUS STRIX 6.6%
DOA/Shop Failures: Reference 1.0%, EVGA ACX 3.9%, ASUS STRIX
1.5%
Field Failures: Reference .7%, EVGA ACX 1.1%, ASUS STRIX 3.4%
Again, we don’t know the speci c usage for each card, but this is
looking at about 4,000 cards in total so it should average out pretty
well. If anything, since we prefer to use the reference cards in 24/7
compute situations this is making the reference cards look worse than
they actually are. The most telling is probably the eld failure rate
since that is where the cards fail over time. In that case, the reference
are only a bit better than the EVGA ACX, but quite a bit better than
the ASUS STRIX cards.
Overall, I would de nitely advise using the reference style cards for
anything that is heavy load. We nd them to work more reliably both
out of the box and over time, and the fact that they exhaust out the
rear really helps keep them cooler – especially when you have more
than one card.
Hayder Hussein:
Recently Nvidia began selling their own cards by themselves (with a
bit higher price). What will be your preference? The cards that Nvidia
are manufacturing and selling by themselves or a third party reference
design cards like EVGA or Asus ?
Matt Bach :
As far as I know, NVIDIA is only selling their own of the Titan X Pascal
card. I think that was just because supply of the GPU core or memory
is so tight that they couldn’t supply all the different manufacturers so
is so tight that they couldn t supply all the different manufacturers so

they decided to sell it directly. I believe the goal is to get it to the
different manufacturers eventually, but who knows when/if that will
happen.
If they start doing that for the other models too, there really shouldn’t
be much of a difference between an NVIDIA branded card and a
reference Asus/EVGA/whatever. Really hard to know if NVIDIA would
have a different reliability than other brands but my gut instinct is that
the difference would be minimal.
Tim Dettmers says

2017-02-01 at 15:10
That is really insightful, thank you for your comment!
Haris Jabbar says

learning/#comment 4179)
Your blog posts have become a must-read for anyone starting on deep learning with
GPUs. Very well written, especially for newbies.
I was wondering though if/when you will write about the new beast: GTX 1080? I
am thinking of putting together a multi GPU workstation with these cards. If you
could compare the 1080 with Titan or 900 series cards, that would be super useful
for me (and i am sure quit a few other folks)
Thomas R says
Thank you for this great article. What is your opinion about the new Pascal GPUs?
How would you rank the GTX1080 and GTX1070 compared to the GTX Titan X? Is
it better to buy the newer GTX 1080 or to buy a Titan X which has more memory?
Tim Dettmers says

Both cards are better. I do not have any hard data on this yet, but it seems that
Both cards are better. I do not have any hard data on this yet, but it seems that
the GTX 1080 is just better — especially if you use 16-bit data.
Amit says
Hey,
Great Writeup.
I have a GTX 970M with i7 6700 (desktop CPU) on a Clevo laptop.

How good is GTX 970m for deep learning?
Tim Dettmers says

A GTX 970m is pretty okay, especially the 6GB variant will be enough to
explore deep learning and t some good models on data. However, you will not
be able to t state of the art models, or medium sized models in good time.
Ricardo says
Great article! I would love to see some benchmarks on actual deep learning tasks.
I was under the impression that single precision could potentially result in large
errors. In large networks with small weights/gradients, won’t the limited precision
propagate through the net causing a snowballing effect?
I admit I have not experimented with this, or tried calculating it, but this is what I
think. I’ve been trying to get my hands on a Titan / Titan Black, but with what you
suggest, it would be much better getting the new Pascal cards.
With that being said, how would ‘half precision’ do with deep learning then?
Tim Dettmers says

The problem with actual deep learning benchmarks is hat you need the actually
The problem with actual deep learning benchmarks is hat you need the actually
hardware and I do not have all these GPUs.
Working with low precision is just ne. The error is not high enough to cause
problems. It was even shown that this is true for using single bits instead of
oats since stochastic gradient descent only needs to minimize the expectation
of the log likelihood, not the log likelihood of mini-batches.
Yes, Pascal will be better than Titan or Titan Black. Half-precision will double
performance on Pascal since half- oating computations are supported. This is
not true for Kepler or Maxwell, where you can store 16-bit oats, but not
compute with them (you need to cast them into 32-bits).
Rishikesh says
What about new nvidia GPUs like GTX 1080 and GTX 1070, please review these
after they released on the perspective of deep learning. Nvidia claim that GTX 1080
performance beat GTX Titan GPU, Is it true for Deep learning task ?
I am about to buy new GPU for deep learning task so please suggest me which GPU,
I should buy with budget vs performance ratio ?
mariano (http://stackover ow.com/questions/37600808/how-to-install-tensor ow-

from-source-with-unof cial-gpus-support-in-ubuntu-16) says
I was able to use Tensor ow, the last google machine learning framework with and
NVidia GTX 960 on Ubuntu 16.04. It’s not of cially supported but can be used.
I’ve posted a tutorial about how to install it here:
http://stackover ow.com/questions/37600808/how-to-install-tensor ow-from-
source-with-unof cial-gpus-support-in-ubuntu-16
(http://stackover ow.com/questions/37600808/how-to-install-tensor ow-from-
source-with-unof cial-gpus-support-in-ubuntu-16)
Suriya says
Hi,
Very nice post! Found it really useful and I felt GeForce 980 suggestion for Kaggle
competitions really apt. However, I am wondering how good are the mobile versions
of the GeForce series for Kaggle such as 940M, 960M, 980M and so on. Any
thoughts on this?
Tim Dettmers says

I think for Kaggle anything >=6GB of memory will do just ne. If you have a
slower 6GB card then you have to wait longer but it is still much faster than a
laptop CPU, and although slower than a desktop you still get a nice speedup
and a good deep learning experience. Getter one of the fast cards is however
often a money issue as laptops that have them are exceptionally expensive. So a
laptop card is good for tinkering and getting some good results on kaggle
competition. However, if you really want to win a deep learning kaggle
competition computational power is often very important and then only the
high end desktop cards will do.
Harvey says
Tim, what is the In niBand 40Gbit/s interconnect card for? Do I absolutely need the
card if I was going to do a muti-GPU solution? And are all three of your Titan X cards
are connected using SLI solutions?
Tim Dettmers says

You only need In niBand if you want to connect multiple computers. For
multiple GPUs you just need multiple PCIe slots and a CPU that supports
enough lanes. SLI is only used for games, but not for CUDA compute.
Niko Bertrand says

Hi Tim, thanks for updating the article! Your blog helped me a lot in increasing my
understanding of Machine Learning and the Technologies behind it.
Up to now I mostly used AWS or Azure for my computations, but I am planning a

new PC build. Unfortunately I have still some unanswered questions where even
the mighty Google could not help!
Anyways I was wondering whether you use all your GPU(s) for Monitor output as
Anyways, I was wondering whether you use all your GPU(s) for Monitor output as
well? I read a lot about screen tearing / blank screens/ X Stopping for a few seconds
while running the algorithms.
A possible solution would be to get a dedicated GPU for Display output such as the
GTX 950 running at x8 to connect 3 monitors while having 2 GTX 1080 at x16
Speed just for computation. What is your opinion / experience regarding this
matter?
Furthermore as my current PC’s CPU, only has 16 PCIe Lanes but an IGPU build in.
Could I use the IGPU for graphics output while a 1080 is build in for computation? I
have found a thread on Quora but the only feedback given was to get a CPU with
40PCIe Lanes. Of course this is true but as cash does not grow on trees and AMD
Zen and new Skylake Extreme Chipsets are on the horizon.
Your feedback is highly appreciated and thanks in advance!
Tim Dettmers says

Personally, I never had any problems with video output from GPUs on which I
also do computation.
The integrated iGPU is independent of your dedicated GTX 1080 and does not
eat up any lanes. So you can easily run graphics from your iGPU and compute
with 16 lanes from your GTX 1080.
The only problem you might encounter are problems with the intel/NVIDIA
driver combination. I would not care too much about the performance
reduction (0-5%) and I have yet to see problems with using a GPU for both
graphics and compute at the same time and I have worked with about 5-6
different setups.
So I would try the iGPU + GTX 1080 setup and if you run into errors just use
the GTX 1080 for everything.
Swapnil says
Hi, I am trying to nd a Kepler card (CC >= 3.0/3.5) for my research. Could you
please suggest one ? I had GeForce GTX Titan X earlier but I could not use it for dual
purpose i.e. for computation and display driver. Titan X does not allow this. So I’m
searching a Kepler card which allows dual purpose. Kindly suggest one.
Tim Dettmers says

Try to recheck your con guration. I am running deep learing and a display
driver on a GTX Titan X for quite some time and it is running just ne.
Niko Bertrand says

Thanks for the quick reply!
One nal question, which may sound completely stupid.

Reference Cards vs Custom GPUs?
Often the Clock speed and sometime VRAM speed is OC by default on many non
reference cards, but if I look through builds and if included screenshots. It seems
that mostly reference cards are used. The only reason I could think of is the
“predictable” cooler height.
Thanks again!
Niko Bertrand says

p g )
Ups meant Founders Edition vs reference cards,.. happens if the mind wanders
elsewhere!
Haider says
Thanks Tim. Great update to this article, as usual.
I was eager to see any info on the support of half precision (16 bit) processing in
GTX 1080. Some articles were speculating few days before their release that they
might be inactivated by nVidia and reserving this feature for future nVidia P100
pascal cards. However after around1 month from releasing the1000 gtx series,
nobody seems to mention anything related to this important feature. This alone, if
had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s
processing power in comparison to maxwell GPUs including Titan X. And as you
mentioned it will add the bonus for less memory requirements up to half. However
it is still not clear whether the accuracy of the NN will be the same in comparison to
the single precision and whether we can do half precision for all the parameters.
Which of course are important to estimate how much will be the speedup and how
much less is the memory requirement for a given task.
Haider says
Thanks Tim. Great update to this article, as usual.
I was eager to see any info on the support of half precision (16 bit) processing in
GTX 1080. Some articles were speculating few days before their release that they
might be inactivated by nVidia and reserving this feature for future nVidia P100
pascal cards. However after around1 month from releasing the1000 gtx series,
nobody seems to mention anything related to this important feature. This alone, if
had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s
processing power in comparison to maxwell GPUs including Titan X. And as you
mentioned it will add the bonus for less memory requirements up to half. However
it is still not clear whether the accuracy of the NN will be the same in comparison to
the single precision and whether we can do half precision for all the parameters.
Which of course are important to estimate how much will be the speedup and how
much less is the memory requirement for a given task.
Joshua Stanton says

Hey there!
great article! I am currently trying to icker train GoogleLeNet on 400 of my own

images using my SLI 780ti’s. But i keep getting errors. such as cannot nd le -> dir
to le location(one of the images im training it on) but the le is there and the
correct dir is in the train le. do you have any idea why this would be? also in the
guide i followed to do this the guy had 4gb vram and used batch of 40 with
256×256 images i did the same but with batch size of 30 to account for the 3gb
vram. am i doing something wrong here? how can i optimise the training to work on
my video card? i appreciate any help you can give! thanks Josh!
Rudra says
Hi Tim,
Thanks for a great article, it helped a lot.

I am a beginner in the eld of deep learning and have built and used only a couple of
architectures on my CPU (currently a student so decided not to invest on GPU’s
right away).
I have a question regarding the amount of CUDA programming required if I decide

to do some sort of research in this eld. I have mostly implemented my vanilla
models in Keras and learning lasagne so that I can come up with novel architecture.
Tim Dettmers says

I know quite many researchers whose CUDA skills are not the best. You often
need CUDA skills to implement ef cient implementations of novel procedures
or to optimize the ow of operations in existing architectures, but if you want
to come of with novel architectures and can live with a slight performance loss,
then no or very little CUDA skills are required. Sometimes you will have cases
where you cannot progress due to your lacking CUDA skills, but this is rarely an
issue. So do not waste your time with CUDA!
Rudra says
Thanks for the reply
Ken says
Is a GT 635 capable of cudnn (conv_dnn) acceleration? in theory its a GK_208

kepler chip with 2GB of mem. I know its a crap card but its the only Nvidia card I had
lying around. I have not been able to get GPU acceleration on WIN 8.1 to work – so
wanted to ask if its my theano/cuda/keras installation thats the issue, or if its the
card.. before I throw any money at the problem and buy a better GPU 960+. Should
I go to windows 10?
Tim Dettmers says

Your card, although crappy, is a kepler card and should work just ne. Windows
could be the issue here. Often it is not well supported by deep learning
frameworks. You could try CNTK which has better windows support
(https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows). If you
try CNTK it is important that you follow this install tutorial step-by-step from
top to bottom.
I would not recommend Windows for doing deep learning as you will often run
into problems. I would encourage you to try to switch to Ubuntu. Although the
experience is not as great when you make the switch, you will soon nd that it is
much superior for deep learning
much superior for deep learning.
Ken says
Thanks Tim, I did eventually get the GT 635 working under WIN 8.1 – on
my Dell, about a 2.7x improvement over my Mac Pro’s 6 core xeons.
Getting things going on OSX was much easier. I still don’t think the GT 635
is using cuDNN (cuDNN not available) but I’ll have to play – I get the sense
I could get another 2x with it. The 2GB of Vram sucks, I really have to limit
the batch sizes I can work with.
Ken says
What are you thoughts on the GTX 1060? an easy replacement for the 960 on the
low end? Over-clocking the 1060 looks like it can get close to a FE 1070 minus 2GB
of memory. Thoughts?
George says
Hello Tim,
I’ve been following your blog for 3 months now and since then I have been waiting
to buy a GTX Titan X Pascal. However, there are rumors about Nvidia releasing the
Volta architecture on the next year with HBM2. What your thoughts about the
investment on a Pascal architecture based GPU currently? Thank you.
Tim Dettmers says

If you already have a good Maxwell GPU the wait for Volta might well be worth
it. However, this of course depends on your applications and then of course you
can always sell your Pascal GPU once Volta hits the market. Both options have
its pro and cons.
Alex telitsine says

I’m curious if uni ed memory in cuda 8 will work for dual 1080,
Then theoretically dual 1080 nvlink setup will crush tyranny in memory and ops ?
Tim Dettmers says

Uni ed memory is more a theoretical than practical concept right now. The
CPU and GPU memory is still managed by the same mechanism as before and
just that the transfers are hidden. Currently you will not see any bene ts for
this over Maxwell GPUs.
Raj says
I have a used 1060 6gb on hand. I am planning to get into research type
deep learning. However I am still getting started and dont understand all
the nitty gritty of parameter tuning batch sizes etc. You mention 6gb would
be limiting in deep learning. If I understand right using small batch sizes
would not converge on large models like resnet with a 1060 (I am shooting
in the dark here wrt terminology since I still a beginner). I was thus looking
to get a 1080ti when I came across an improved form of “uni ed memory”
introduced from Pascal and cuda 6+. I have a 6core HT xeon cpu+32gb
ram. Could I use some system ram to remove the 6gb limitation.
I understand that p100 can do this and not incur heavy copy latency due to
a new “page migration engine” feature wherein if access for data on gpu
memory leads to a pagefault the program suspends the call and request
the relevant memory page (instead of whole data) from the cpu.
http://parallelplusplus.blogspot.in/2016/09/nvidia-pascals-gpu-
architecture-most.html
(http://parallelplusplus.blogspot.in/2016/09/nvidia-pascals-gpu-
architecture-most.html)
There is confusion about this feature in the 1080ti even though it uses the
same gp100 module. The conclusion being that it can check for “page fault”
but not do “prefetch”.
https://stackover ow.com/a/43770299
(https://stackover ow.com/a/43770299)
and is seconded by
https://stackover ow.com/a/40011988
(https://stackover ow.com/a/40011988) for titan xp
Since 1080 (and by inference 1060 6gb since they both have gp400) also
has this ConcurrentManagedAccess set to 1 according to
https://devtalk.nvidia.com/default/topic/1015688/on-demand-paging/?
offset=5 (https://devtalk nvidia com/default/topic/1015688/on demand
offset=5 (https://devtalk.nvidia.com/default/topic/1015688/on-demand-
paging/?offset=5)
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-
device-properties (http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html#um-device-properties)
I am guessing I wouldnt bene t much from a new purchase of a 1080ti.
however it looks like these features in cuda 8 in c++ havent been yet been
applied to higher language libraries.
https://github.com/inducer/pycuda/issues/116
(https://github.com/inducer/pycuda/issues/116)
Looks like AMD’s Vega might have this feature too through IOMMUv2
hardware passthrough and considering since AMD’s miopen 1.0 supports
tensor ow and torch 7. It might be an good alternative.
http://parallelplusplus.blogspot.in/2017/01/ ne-grained-memory-
management-on-amd.html
(http://parallelplusplus.blogspot.in/2017/01/ ne-grained-memory-
management-on-amd.html)
http://www.amd.com/system/ les/2017-06/TIRIAS-AMD-Epyc-GPU-
Server.pdf (http://www.amd.com/system/ les/2017-06/TIRIAS-AMD-
Epyc-GPU-Server.pdf)
https://www.reddit.com/r/Amd/comments/61oolv/vega_you_are_no_longe
r_limited_by_the_amount_of/
(https://www.reddit.com/r/Amd/comments/61oolv/vega_you_are_no_long
er_limited_by_the_amount_of/)
Would be interested in hearing your thoughts speci cally if a 1060 would

be capable of addressing more than 6gb without heavy penalty maybe in
the future (cant extrapolate what the yes to ‘page fault’ and no to ‘prefetch’
would mean in this context) and if it would be faster/slower than the AMD
would mean in this context) and if it would be faster/slower than the AMD
vega solution using IOMMU
Tim Dettmers says

I think the easiest and often overlooked option is just to switch to 16-
bit models which doubles your memory. This is supported by most
libraries (you are right that the “page migration engine” is not
supported by any deep learning library). Other than that I think one
could always adjust the network to make it work on 6GB — with this
you will not be able to achieve state-of-the-art results, but it will be
close enough and you save yourself from a lot of hassle. I think this
also makes practically the most sense.
Anonymous (https://excellence2016.wordpress.com/) says

My brother recommended I would possibly like this blog.

He used to be totally right. This submit actually made my day.
You can not believe just how much time I had spent for this info!
Thanks!
Tim Dettmers says

I am glad to hear that you and your brother found my blog post helpful
Thank you!
vikram says
hi
my confusions is
1) quadro series k2000 and higher capable enough for deep learning beginning
1) quadro series k2000 and higher capable enough for deep learning beginning.
2)keplar , maxwel, pascal how much difference dose it make on performance as a
beginner
3)gtx titan x is pascal or maxwell
5)parameters to be considered for comparison of GPU as far as deep learning is
concern
4)please suggest any GPU according to you
Tim Dettmers says

1) A k2000 will be okay and you can do some tests, but its memory and
performance will not be suf cient for larger datasets
2) Get Maxwell or Pascal GPU if you have the money; Kepler is slow
3) There is a one Titan X for Pascal and one for Maxwell
5) Look at memory bandwidth mostly
4) GTX 1060
Vikram says
Exceptionally excellent blog

Thank-you you so much for your valuable reply
Ondrej says
Hi Tim,
rst of all, thank you for your awesome articles about deep learning. They have been
very usefull for me. Since I use Caffe and CNTK framework for deep learning and
GPU computing speed is very important, encouraged by your last article update
(GTX Titan X Pascal = 0.7 GTX 1080 = 0.5 GTX 980 Ti) and very positive reviews on
Internet, I decided to upgrade my GTX 980 Ti (Maxwell) with brand new GTX 1080
(Pascal). In order to compare performance of both architectures – new Pascal with
old Maxwell (and of course because I just want to see how well my new GTX 1080
performs to justify expense I benchmarked both cards in Caffe (CNTK is not
cuDNN 5 ready yet). To my big surprise new GTX 1080 is about 20% slower in
AlexNet training than old GTX 980 Ti. I realized two benchmarks in order to
compare performance in different operating systems but with practically same
results. The reason why I chosed different versions of CUDA and cuDNN is that
Pascal architecture is supported only in CUDA 8RC and cuDNN 5.0 and Maxwell
architecture performs better in CUDA 7.5 and cuDNN 4.0 (otherwise you get poor
performance).
Maybe I have done something wrong in my benchmark (but I’m not aware
anything ) could you give me some advice how to improve training perfomance on
anything…), could you give me some advice how to improve training perfomance on
GTX 1080 with Caffe? Is there any other framework which support Pascal
architecture with full speed?
First benchmark:
OS: Windows 7 64bit
Nvidia drivers: 368.81
Caffe buid for GTX 1080 : Visual studio 2013 64bit, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 4.0
GTX 1080 performance : 4512 samples/sec.
GTX 980 Ti perfomance: 5407 samples/sec. (cuDNN 4.0) best performance
GTX 980 Ti perfomance: 4305 samples/sec. (cuDNN 5.0)
GTX 980 Ti perfomance: 4364 samples/sec. (cuDNN 5.1)
Second benchmark:
OS: Ubuntu 16.04.1
Nvidia drivers: 367.35
Caffe buid for GTX 1080 : gcc 5.4.0, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: gcc 5.4.0, CUDA 7.5, cuDNN 4.0
GTX 1080 performance : 4563 samples/sec.
GTX 980 Ti perfomance: 5385 samples/sec.
Thank you very much,

Ondrej
Tim Dettmers says

Thank you Ondrej for sharing — these are some very insightful results!
I am not entirely sure how convolutional algorithm selection works in Caffe, but
this might be the main reason for the performance discrepancy. The cards
might have better performance for certain kernel sizes and for certain
convolutional algorithms. But all in all these are quite some hard numbers and
there is little room for arguing. I think I need to update my blog post with some
new numbers. To learn that the performance of Maxwell cards is such much
better with cuDNN 4.0 is also very valuable. I will de nitely add this in an
update to the blog post.
Thanks again for sharing all this information!
stoo says
The best article about choosing GPUs for deep learning I’ve ever read!
As a CNN learner with few budget, I decide to buy a GTX 1060 replacing the old
Quardro K620. Since GTX 1060 does not support SLI and you wrote that “using the
PCIe 3 0 interface for communication in multi GPU applications” I am a little worry
PCIe 3.0 interface for communication in multi-GPU applications . I am a little worry

about upgrading later soon. Should I buy a GTX 1070 instead of GTX 1060? Thanks.
Tim Dettmers says

Maybe this was a bit confusing, but you do not need SLI for deep learning
applications. The GPUs communicate via the channels that are imprinted on
the motherboard. So you can use multiple GTX 1060 in parallel without any
problem.
Pablo Castillo says

Thanks for sharing your knowledge about this topics.

Regards.
Tharun says
Dear Tim,
Extremely thankful for the info provided in this post.

We have GPU server on which CUDA 6.0 is installed and it has two Tesla T10
graphic cards. I have a question if I can use this GPU system for deep learning as
Tesla T10 is quite old as of now. I am facing some hardware issues with installing
caffe on this server. It has ubuntu 12.04 LTS as OS.
Thanks in advance
Tharun
Tim Dettmers says

The T10 Tesla chip has a too low compute capability so that you will not be able
to use cuDNN. Without that you can still run some deep learning libraries but
your options will be limited and training will be slow. You might want to just use
your CPU or try to get a better GPU.
Alex N says
Hi Tim
Your site contains such a wealth a knowledge. Thank you.
I am interested in having your opinion on cooling the GPU. I contacted NVIDIA to

ask what cooling solutions they would recommend on a GTX Titan X Pascal in
regards to deep learning and they suggested that no additional cooling was
required. Furthermore, they would discourage adding any cooling devices (such as
EK WB) as it would void the warranty. What are your thoughts? Is the new Titan
Pascal that cooling ef cient? If not, is there a device you would recommend in
particular?
Also, I am mostly interested in RNN and I plan on starting with just one GPU. Would
you recommend a second GPU in light of the new SLI bridge offered by NVIDIA? Do
you think it could deliver increased performance on single experiment?
Alex N says

I would also like to add that looking at the DevBox components,
I would also like to add that looking at the DevBox components,

No particular cooling is added except for suf cient GPU spacing and upgraded
front fans.
http://developer.download.nvidia.com/assets/cuda/secure/DIGITS/DIGITS_DE
VBOX_DESIGN_GUIDE.pdf?
autho=1471007267_ccd7e14b5902fa555f7e26e1ff2fe1ee& le=DIGITS_DE
VBOX_DESIGN_GUIDE.pdf
(http://developer.download.nvidia.com/assets/cuda/secure/DIGITS/DIGITS_D
EVBOX_DESIGN_GUIDE.pdf?
autho=1471007267_ccd7e14b5902fa555f7e26e1ff2fe1ee& le=DIGITS_DE
VBOX_DESIGN_GUIDE.pdf)
Tim Dettmers says

From my experience addition fans for your case are negligible (less than 5
degrees differences; often as low as 1-2 degrees). Increasing the GPU fan
speed by 1% often has a larger effect than additional case fans.
Tim Dettmers says

If you only run a single Titan X Pascal then you will indeed be ne without any
other cooling solution. Sometimes it will be necessary to increase the fan speed
to keep the GPU below 80 degrees, but the sound level for that is still bearable.
If you use more GPUs air cooling is still ne, but when the workstation is in the
same room then noise from the fans can become an issue as well as the heat (it
is nice in winter, then you do not need any additional heating in your room, even
if it is freezing outside). If you have multiple GPUs then moving the server to
another room and just cranking up the GPU fans and accessing your server
remotely is often a very practical option. If those options are not for you water
cooling offers a very good solution.
Chad says
Hi Tim,
Thanks for the great article and thanks for continuing to update it!
Am I correct that the Pascal Titan X doesn’t support FP16 computations? So if

TensorFlow or Theano (or one’s library of choice) starts fully supporting FP16,
would the GTX 1080 then be better than the new Titan X as it would have larger
effective (FP16) memory? Do But perhaps I am missing something
effective (FP16) memory? Do But perhaps I am missing something…
Is it clear yet whether FP16 will always be suf cient or might FP32 prove necessary
in some cases?
Thanks!
Tim Dettmers says

Hey Chad, the GTX 1080 also does not support FP16 which is a shame. We will
have to wait for Volta for this I guess. Probably FP16 will be suf cient for most
things, since there are already many approaches which work well with lower
precision, but we just have to wait.
Chad says
ah, ok. got it. Thanks a lot!

Mike K says
Which GTX, if any, support int8? Does Tensor ow support int8? Thanks for
the great blog.
Tim Dettmers says

All GPUs support int8, both signed and unsigned; in CUDA this is just
a signed or unsigned char. I think you can do regular computation just
ne. However, I do not know how the support for Tensor ow is, but in
general most the deep learning frameworks do not have support for
computations on 8-bit tensors. You might have to work closer to the
CUDA code to implement a solution, but it is de nitely possible. If
work with 8-bit data on the GPU, you can also input 32-bit oats and
then cast them to 8-bits in the CUDA kernel; this is what torch does in
its 1-bit quantization routines for example.
CW says
Hi Tim,
Would multi lower tier gpu serve better than single high tier gpu given similar cost?
Eg.
3 x 1070
vs
1 x Titan X Pascal
Which would you recommend?
Tim Dettmers says

Here is one of my quora answer (https://www.quora.com/Is-the-NVIDIA-Titan-

X-better-than-two-GTX-980-for-deep-learning)s which deals exactly with this
problem. The cards in that example are different, but the same is true for the
new cards
new cards.
CW says
Thank for the reply.

I am not sure if I understand the answer correctly, is it pcie bandwidth the
bottleneck bandwidth you are referring which is around 8GB/s due to
multiple cards compared to single larger memory titan X’s bandwidth of
336Gb/s?
One more question, does slower ddr ram bandwidth will impact the
performance of deeplearning ?
Tim Dettmers says

That is correct, for multiple cards the bottleneck will be the

connection between the cards which in this case is the PCIe
connection. Slower DDR RAM bandwidth almost decreases
performance by as much as the bandwidth is lower so it is quite
performance by as much as the bandwidth is lower, so it is quite

important. This comparison however is not valid between different
GPU series e.g. invalid for Maxwell vs. Pascal.
Mo says
Hi Tim. Great article.

One question: I have been given a Quadro M6000 24GB. How do you think it
compares to a Titan or Titan X for deep learning (speci cally Tensor ow)? I’ve used
a Titan before and I am hoping that at least it wouldn’t be slower.
Thank you.
Tim Dettmers says

The Quadro M6000 is an excellent card! I do not recommend it because it is
The Quadro M6000 is an excellent card! I do not recommend it because it is

not very cost ef cient. However, the very large memory and high speed which
is equivalent to a regular GTX Titan X is quite impressive. On normal cards, you
do not have more than 12GB of RAM which means you can train very large
models on your M6000. So I would de nitely stick to it!
Mo says
Awesome, thanks for the quick response.
Juan says
Hey Tim, thank you so muuuch for your article!! I am in the “I started deep learning
and I am serious about it” group and will buy a GTX 1060 for it. I am more
speci cally interested in autonomous vehicle and Simultaneous Localization and
Mapping You article has helped me clarify my currents needs and match it with a
Mapping. You article has helped me clarify my currents needs and match it with a
GPU and budget.
You have a new follower here!
Thanks!
Tim Dettmers says

Thank you for your kind words — I am glad that I you found my article helpful!
Tim says
Hey Tim,
Thank you for this fantastic article. I have learned a lot in these past couple of weeks
on how to build a good computer for deep learning.
My question is rather simple but I have not found an answer yet on the web: should
My question is rather simple, but I have not found an answer yet on the web: should
I buy one Titan X Pascal or two GTX 1080s?
Thank you very much for your time,

Tim
Tim Dettmers says

Hey Tim,
In the past I would have recommended one faster bigger GPU over two smaller,
more cost-ef cient ones, but I am not so sure anymore. The parallelization in
deep learning software gets better and better and if you do not parallelize your
code you can just run two nets at a time. However, if you really want to work on
large datasets or memory-intensive domains like video, then a Titan X Pascal
might be the way to go. I think it highly depends on the application. If you do not
necessarily need the extra memory — that means you work mostly on
applications rather than research and you are using deep learning as a tool to
get good results, rather than a tool to get the best results — then two GTX
1080 should be better. Otherwise go for the Titan X Pascal.
Tim
Tim says
Hi Tim,
First of all thank you for your reply.

I am ready to nally buy my computer however I do have a quick question
about the 1080 ti and the Titan xp. For a researcher that does some GAN,
LSTM and more, would you recommend 2x 1080 ti or just one Titan xp. I
understand that in your rst post you said that the Titan X Pascal should be
the one, however I would like to know if this is still the case on the newer
versions of the same graphics cards.
Thank you so much for updating the article!

Tim
Tim Dettmers says

I think two GTX 1080 Ti would be a better t for you. It does not
sound like you would need to push the nal performance on ImageNet
where a Titan Xp really shines. If you want to build new algorithms on
top of tried and tested algorithms like LSTM and GANs then two
GPUs (which are still very fast) and 1 GB less memory will be far
better than one big GPU.
Eric says
Tim!
Thanks so much for your article. It was instrumental in me buying the Maxwell Titan
X about a year ago. Now, I’ve upgraded to 4 Pascal Titan X cards, but I’m having
some problems getting performance to scale using data parallelism.
I’m trying to chunk a batch of images into 4 chunks and classify them (using caffe) on
the 4 cards in parallel using 4 separate processes.
I’ve con rmed the processes are running on the separate cards as expected, but
performance degrades as I add new cards. For example, if it takes me 0.4 sec / image
on 1 card alone, when I run 2 cards in parallel, they each take about 0.7 sec / image.
Have you had any experience using multiple Pascal Titan X’s in this manner? Am I
just missing something about the setup/driver install?
Thanks!
Tim Dettmers says

Were you getting better performance on your Maxwell Titan X? It also depends
heavily on your network architecture; what kind of architecture were you
using? Data parallelism in convolutional layers should yield good speedups, as
do deep recurrent layers in general. However, if you are using data parallelism
on fully connected layers this might lead to the slowdown that you are seeing —
in that case the bandwidth between GPUs is just not high enough.
Tim says
Hi Tim,
Thank you very much for the fast answer.
I just have one more question that is related to the CPU. I understand that having
more lanes is better when working with multiple GPUs as the CPU will have enough
bandwidth to sustain them. However, in the case of having just one GPU is it
necessary to have more than 16 or 28 lanes? I was looking at the *Intel Core i7
necessary to have more than 16 or 28 lanes? I was looking at the Intel Core i7-
5930K 3.5GHz 6-Core Processor*, which has 40 lanes (and is the cheapest in that
category) but also requires an LGA 2011 and DDR4 memory which are expensive.
Is this going to be too much of an overkill for the Titan X Pascal?
Thank you for your time!

Tim
Tim Dettmers says

If you are having only 1 card, then 16 lanes will be all that you need. If you
upgrade to two GPUs you want to have either 32+ lanes (16 lanes for each) or
just stick with 16 lanes (8 lanes for each) since the slowest GPU will always
draw down the other one (24 lanes means 16x + 8x lanes, and for parallelism
this will be bottlenecked by the 8 lanes). Even if you are using 8 lanes, the drop
in performance may be negligible for some architectures (recurrent nets with
many times steps; convolutional layers) or some parallel algorithms (1-bit
quantization, block momentum). So you should be more than ne with 16 or 28
lanes.
im says
I compared quadro k2200 with m4000.

k2200 won surprisingly m4000 in the simple network.
I am looking for a higher performance single-slot GPU than k2200.
How about k4200 ?

quadro k4200 ( single-slot and single precision = 2,072.4 g ops )
Tim Dettmers says

Check your benchmarks and if they are representative of usual deep learning
performance. The K2200 should not be faster than a M4000. What kind of
simple network were you testing on?
im says

I tested the simple network on a chainer default example as below.
I tested the simple network on a chainer default example as below.

python examples/mnist/train_mnist.py –gpu 0
result
K2200
avg : 14288.94418 images/sec
M4000
avg : 13617.58361 images/sec
However I con rmed that the M4000 is faster than a K2200 in the
complex netork like alexnet.
[convnet-benchmarks]
./convnet-benchmarks/chainer/run.sh
result
K2200
alexnet : 639ms, overfeet : 2195ms
M4000
alexnet : 315ms, overfeet : 1142ms
I think that GPU clock is effective in the simple network.

Is this correct ?
GPU Clock
K2200
1045 MHz
M4000
772 MHz
Shading Units
Shading Units
K2200
640
M4000
1664
Kristofer says
Hi! Great article, very informative. However, I want to point out that the NVIDIA
Geforce GTX Titan X and the NVIDIA Titan X are two different graphics cards (yes
the naming is a little bit confusing). The Geforce GTX Titan X has Maxwell
microarchitecture, while the Titan X has the newer Pascal microarchitecture. Hence
there is no “GTX Titan X Pascal”.
Tim Dettmers says

Ah this is actually true. I did not realize that! Thanks for pointing that out!
Ah this is actually true. I did not realize that! Thanks for pointing that out!
Probably it is still best to add Maxwell/Pascal to not confuse people, but I
should remove the GTX part.
chanhyuk jung says

I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars.
The electricity bills grows exponentially. I usually train unsupervised learning
algorithms on 8 terabytes of video. Which gpu or gpus should I buy?
chanhyuk jung says

I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars.
The electricity bills grows exponentially. I usually train unsupervised learning
algorithms on 8 terabytes of video. Which gpu or gpus should I get? The titan x
pascal had the most bandwidth per watt but it’s a lot expensive for the little gain of
pascal had the most bandwidth per watt but it s a lot expensive for the little gain of
performance per watt.
Nikos Tsarmpopoulos says

Take into consideration the potential cost of electricity when comparing the
options of building your own machine versus renting one on a data centre.
andrey kim says

Great article. I am just a noob at this and learning . not a researcher, but application
guy. I have an old mac pro 2008 with 32gb of ram (fb-dimm in 8 channel) on dual
xeon quad core at 2.8ghz.(8 core, 8 thread) I’ve been using gtx750ti with 4gb on
deep mask/sharp mask on torch. COCO image set took 5 days to train through 300
epoch on deep mask. I am wondering how much performance increase would I see
going to GTX 1070? or I am wondering if I could add second gtx750ti that matches
the one I got instead for 8gb of ram. (have room for 2 gpu)
thanks for everything
Tim Dettmers says

Adding a GTX 750Ti will not increase your overall memory since you will need
to make use of data parallelism where the same model rests on all GPUs (the
model is not distributed among GPU so you will see no memory savings). In
terms of speed, and upgrade to a GTX 1070 should be better than two GTX
750Ti and also signi cantly easier in terms of programming (no multi-GPU
programming needed). So I would go with the GTX 1070
navdeep singh says

Hi Tim, thanks for an insightful article!

I picked up new 13″ macbook with thunderbolt 3 ports, i am thinking of a setup
using GTX-1080 using eGFX enclosure –
http://www anandtech com/show/10783/powercolor announces devil box
http://www.anandtech.com/show/10783/powercolor-announces-devil-box-
thunderbolt-3-external-gpu-enclosure
(http://www.anandtech.com/show/10783/powercolor-announces-devil-box-
thunderbolt-3-external-gpu-enclosure) . What do you think of this idea?
Tim Dettmers says

It should work okay. There might be some performance problems when you
transfer data from CPU to GPU. For most cases this should not be a problem,
but if your software does not buffer data on the GPU (sending the next mini-
batch while the current mini-batch is being processed) then there might be
quite a performance hit. However, this performance hit is due to software and
not hardware, so you should be able to write some code to x performance
issues. In general the performance should be good in most cases with around
90% performance.
Carlo says
Hi
I want to test multiple neural networks against each other using encog. For that i
want to get a nvidia card. After reading your article i think about getting the 1060
but since most calculations in encog using double precision would the 780 ti be a
better t? The data le will not be large and i do not use images.
Thanks
Tim Dettmers says

The GTX 780 Ti would still be slow for double precision. Try to get a GTX Titan
(regular) or GTX Titan Black they have excellent double precision performance
and work generally quite okay even in 32-bit mode.
aws training (http://www.visualpath.in/amazon-web-services-aws-training.html) says

Thanks for sharing this- good stuff! Keep up the great work, we look forward to
Thanks for sharing this good stuff! Keep up the great work, we look forward to
reading more from you in the future!
James says
Hi Tim,
Really useful post, thanks.
I wondered about the Titan Black – looking online the memory bandwidth, 6GB of
memory, single and double precision are better than a 1060 and at current eBay
prices, are about 10-15% cheaper than a 1060.
Other than the lower power of the 1060 and warranty, would there be any reason
to choose the 1060 over a Titan Black?
Thank you
Tim Dettmers says

The architecture of the GTX 1060 is more ef cient than the Titan Black
architecture. Thus for speed, the GTX 1060 should still be faster, but probably
not by much. So the GTX Titan Black is a solid choice, especially if you also want
to use the double precision.
Markus says
Hi Tim,
thanks for the article, it’s the most useful I found during my 14-hour google-
marathon!
I’m very new to deep learning, starting with YOLO I’ve found that my gtx 670 with
2GB is seriously limiting what I can explore. Inferring from your article, if I stack
multiple GPUs for CNNs, the memory will in principle add up, right? I’m asking
because I will make a decision between a used Maxwell Titan X or a 1070/1080, and
my main concern is the memory, thus I would like to know if it is a reasonable
memory upgrade option (for CNNs) to add a second card at some point when they
are cheaper. Furthermore, if the 1080 and the used Maxwell Titan X are the same
price, as this a good deal?
Also, I’m concerned with the FP16x2 feature for the 1070/1080, adding only one
FP16x2 core every 128 FP32 cores: If I’m using FP16, the driver might report my
card is FP16v2
capable and thus a framework might use these few FP16v2 cores instead of
capable, and thus a framework might use these few FP16v2 cores instead of
emulating FP16 arithmetics by promoting to FP32. Is this a valid worst-case
scenario for e.g. caffe/torch/… or am I confusing something here? Also, I’ve read that
before Pascal, there is effectively no storage bene t from FP16, as the numbers
need to be promoted to FP32 anyway. I can only understand this if the data needs
to be promoted before fetching it to the registers for computation, is this right?
Thank you
Tim Dettmers says

Hi Markus,
unfortunately the memory will not stack up, since you probably will use data
parallelism to parallelize your models (the only way of parallelism which really
works well and is fast).
If you can get a used Maxwell Titan X cheap this is a solid choice. I personally
would not mind the minor slowdown compare the the added exibility, so I
would go for the Titan X as well here.
Currently, you do not need to worry about FP16. Current code will make use of
FP16 memory, but FP32 computations so that the slow FP16 compute units on
the GTX 10 series will not come into play. All of this probably only becomes
relevant with the next Pascal generation or even only with Volta
relevant with the next Pascal generation or even only with Volta.
I hope this answered all of your questions. Let me know if things are still
unclear.
ywteh (http://www.stats.ox.ac.uk/~teh) says

Hi Tim, thanks for a great article! I’m just wondering if you had experience with
installing the GTX or Titan X on rackmount servers? Or if you have
recommendations for articles or providers on the web? (I’m in UK). I am having a
long running discussion with IT support about whether it is possible, as we couldn’t
nd any big providers that would put together such a system. The main issue seems
to revolve around cooling as IT says that Teslas are passive cooled while Titan X are
active cooled, and may interfere with the server’s cooling system.
Tim Dettmers says

I think the passively cooled Teslas still have a 2-PCIe width, so that should not
I think the passively cooled Teslas still have a 2 PCIe width, so that should not
be a problem. If the current Tesla cards are 1-PCIe width, then it will be a
problem and Titan Xs will not be an option.
Cooling might indeed also an issue. If the passively cooled Teslas have intricate
cooling ns then their cooling combined with active server cooling might
indeed be much superior to what Titan Xs can offer. Cooling systems for
clusters can be quite complicated and this might lead to Titan Xs breaking the
system.
Another issue might be just buying Titan Xs in bulk. NVIDIA does not sell them
in bulk, so you will only be able to equip a small cluster with these cards (this is
also the reason why you do not nd any providers for such as system).
Hope this helps!
Neo says
Hi Tim, I found a interesting thing recently.

I tried one Keras(both theano and tensor ow were tested) project on three
different computing platforms:
A : ssd+i5 3470(3 2GHz)+GTX750Ti (2G)
A : ssd+i5 3470(3.2GHz)+GTX750Ti (2G)

B: ssd+E52620 v3+TITAN X (12G)
C: HDD+i56300HQ(2.6GHz)+GTX965M(4G)
With the same setting of cuda 8.0 and cuDNN 5.0 , A & B got similar GPU
performance. However, I cannot understand why C is about 5 times slower than A. I
guessed C could perform better than A before the experiment.
Tim Dettmers says

As I understand it Keras might not prefetch data. On certain problems this

might introduce some latency when you load data, and loading data from hard
disk is slower than SSD. If the data is loaded into memory by your code, this is
however unlikely the problem. What strikes me is that A and B should not be
equally fast. C could also be slow due to the laptop motherboard which has a
poor or reduced PCIe connection, but usually this should not be such a big
problem. Of course this could still happen for certain datasets.
Nader says
How about gtx 1070 SLI ?
An Tran (http://antran89.github.io/) says

“However, training may also take longer, especially the last stages of training where
it becomes more and more important to have accurate gradients.” WHY the last
stages of training is important? Any justi cation
Tim Dettmers says

It is easy to improve from a pretty bad solution to an okay solution, but it is very
dif cult to improve from a good solution to a very good solution. Improving our
100 meter dash time by a second is probably not so dif cult, while for an
Olympic athlete it is sheer impossible because they already operate at a very

high level. This goes the same for neural net and their solution accuracy.
Nader says
Does having an AMD card support Theano / Keras ?
Maciej Wieczorek says

Amazon has introduced a new class of instances: Accelerated Computing Instances

(P2), with 12GB K80 GPUs. These are much, much better than the older G2
instances, and go for $0.90/hr. Does this change anything in your analysis?
Tim Dettmers says

With such an instance you get one K80 for $0.9/h which means $21.6/day and
$648/month. If you use your GPU for more than one GPU month (runtime)
then it probably gets cheaper and cheaper to buy your own GPU. I do not think
it makes really sense for most people.
It is probably a good option for people doing Kaggle competitions since most of
the time will be spend still on feature engineering and ensembling. For
researchs, startups, and people who learn deep learning it is probably still more
attractive to buy a GPU.
Hesam M says
Hello,
I decided to buy a GTX 1060 or GTX 1070 card to try with Deep Learning, but I am
curious if the RAM size of The GPU or its bandwidth/speed will affect the
ACCURACY of the nal model or not, by comparing these two speci c GPU cards.
in the other word, I want to know selecting the GTX 1060 will just cause longer
training time over GTX 1070, or it will affect the accuracy of the model either.
Tim Dettmers says

Hi Hesam, the two cards will yield the same accuracy. There are some elements
in the GPU which are non-deterministic for some operations and thus the
results will not be the same, but they always be of similar accuracy.
Joe says
Hi, it’s been a pleasure to read this article! Thanks!
Have you done any comparison of 2 x Titan X against 4 x GTX 1080? Or maybe you
have some thoughts regarding it?
Tim Dettmers says

The speed of 4x 1080 vs 2 Titan X is dif cult to measure, because parallelism is

still not well supported for most frameworks and the speedups are often poor.
If you look however at all GPUs separately, then it depends on how much
memory your tasks needs. If 8GB are okay, 4x 1080 are de nitely better than
2x Titan X, if not, then 2x Titan X are better.
Erik (https://nomansmind.com) says

Hi Tim,
rst of all thank you for this great article.
I understand that the memory clock speed is quite important and depending on
which graphics card manufacturer/line I choose, there will be up to a 10%
difference.
Here is a good overview[German]
http://www.pcgameshardware.de/Pascal-Codename-265448/Specials/Geforce-
GTX-1080-Custom-Designs-Herstellerkarten-1198846/
(http://www.pcgameshardware.de/Pascal-Codename-265448/Specials/Geforce-
GTX 1080 Custom Designs Herstellerkarten 1198846/)
GTX-1080-Custom-Designs-Herstellerkarten-1198846/)
I am going to buy a 1080 and I am wondering if it makes sense to get such an OC

one.
Do you have any experience with / advice on this?
Thank you for an answer,

Erik
Tim Dettmers says

OC GPUs are good for gaming, but they hardly make a difference for deep
learning. You are better of buying a GPU with other features such as better
cooling. When I tested overclocking on my GPUs it was dif cult to measure any
improvement. Maybe you will get something in the range of 1-3% improved
performance for OC GPUs — so not so much worth it if you need to pay extra
for OC.
Ink says
Could you add AWS’s new P2 instance into comparison? Thank you very much!
Tim Dettmers says

The last time I checked the new GPU instances were not viable due to their
pricing. Only in some limited scenarios, where you need deep learning
hardware for a very short time do AWS GPU instances make economic sense.
Often it is better to buy a GPU even if it is a cheaper, slower one. With that you
will get much more GPU accelerated hours for your money compared to AWS
instances. If money is less of in issue AWS instances also make sense to re up
some temporary compute power for a few experiments or training a new model
for startups.
Eric PB says

Hi Tim,
Hi Tim,
With the release of the GTX 1080 Ti and the revamp+reprice of GTX 1060/70/80,
would you change anything in your TL;DR section, especially vs Pascal Titan X ?
Links to key points:

– GTX 1080 Ti: http://wccftech.com/nvidia-geforce-gtx-1080-ti-unleash-699-usd/
(http://wccftech.com/nvidia-geforce-gtx-1080-ti-unleash-699-usd/)
– Revamp+reprice of GTX 1060/70/80 etc.: http://wccftech.com/nvidia-geforce-
gtx-1080-1070-1060-of cial-price-cut-specs-upgrade/
(http://wccftech.com/nvidia-geforce-gtx-1080-1070-1060-of cial-price-cut-specs-
upgrade/)
Cheers,
E.
Tim Dettmers says

Thank you so much for the links. I will have to look at those details, make up my
mind, and update the blog post. On a rst look it seems that the GTX 1070 8GB
would be really the way to go for most people. Just a lot of bang for the buck.
The NVIDIA Titan X seems to become obsolete for 95% of people (only vision
researchers that need to squeeze every last bit of RAM for should use it) and
the GTX 1080 Ti will be the way to go if you want fast compute.
Thomas Rupp says

Hi Tim,
Thank you for the great article and answering our questions.
NVIDIA just announced their new GTX 1080 TI. I heard that it shall even
outperform the Titan X Pascal in gaming. I did not read anything about the
performance of the GTX 1080 TI in Machine Learning / Deep Learning yet.
I am building a PC at the moment and have some parts already. Since the Titan X
was not available over the last few weeks, I could still get the GTX 1080 TI instead.
1.) What is better, the GTX 1080 TI or Titan X? If the difference is very small, I
would choose the cheaper 1080 TI and upgrade to Volta in a year or so. Is the only
difference the 11 GB instead of 12 and a little bit faster clock or are some features
disabled that could make problems with deep learning?
2.) Is half precision available in the GTX 1080 TI and/or the Titan X? I thought that it
is only available in the much more expensive Tesla cards, but after reading through
the Replies here, I am not sure anymore. To be more precise, I only care of the half
precision ( oat 16) when it brings a considerable speed improvement (In Tesla
precision ( oat 16) when it brings a considerable speed improvement (In Tesla
roughly twice as fast compared to oat 32). If it is available but with the same speed
as oat 32, I obviously do not need it.
Looking forward to your reply.

Thomas
Thomas Rupp says

Hi Tim,
one more question: How much of an issue will the 11GB of the GTX 1080 TI be
compared to the 12GB on the Titan X? Does that mean that I cannot run many
of the published models that were created by people on a 12GB GPU? Do
people usually ll up all of the memory available by creating deep nets that just
t in their GPU memory?
Tim Dettmers says

Some of the very state of the art models might not run on some of the
Some of the very state of the art models might not run on some of the
datasets. But in general, this is a no-issue. You will still be able to run the
same models, but instead of 1000 layers you will only have something like
900 layers. If you are not someone which does cutting edge computer
vision research, then you should be ne with the GTX 1080 Ti.
Alternatively, you can always run these models in 16-bit on most

frameworks just ne. This is so, because most models make use of 32-bit
memory. This thus requires a bit of extra work to convert the existing
models to 16-bit (usually a few lines of code), but most models should run. I
think you will be able to run more than 99% of the state of the art models
in deep learning, and about 90% of the state of the art models in computer
vision. So de nitely go for a GTX 1080 Ti if you can wait for so long.
Tim Dettmers says

You are welcome. I am always happy if my blog posts are useful!
1.) I would de nitely go with the GTX 1080 TI due to price/performance. The
extra memory on the Titan X is only useful in a very few cases. However
beware, it might take some time between announcement, release and when the
GTX 1080 Ti is nally delivered to your doorstep make sure you have that
GTX 1080 Ti is nally delivered to your doorstep — make sure you have that
spare time. Make also sure you preorder it; when new GPUs are released their
supply is usually sold within a week or less. You do not want to wait until the
next batch is produced.
2.) Half precision is implemented on the software layer, but not on the
hardware layer for these cards. This means you can use 16-bit computation but
software libraries will instead upcast it to 24-bit to do computation (which is
equivalent to 32-bit computational speed. This means that you can bene t
from the reduced memory size, but not yet from the increased computation
speed of 16-bit computation. You only see this in the P100 which nobody can
afford and probably you will only see it for consumer cards in Volta series cards
which will be released next year.
Ark Aung (https://github.com/arkaung) says

Now that GTX1080Ti is out, would you recommend that over Titan X?
Anonymous says
Anonymous says
Considering the incoming refresh of Geforce 100, should I purchase entry-level

1060x6GB now or will there be something more interesting in the near future?
Tim Dettmers says

The GTX 1060 will remain a solid choice. I would pick the GTX 1060 if I were
you.
Chris says
Thanks for the brilliant summary! Wish I have read this before the purchase of
1080, I would have bought 1070 instead as it seems a better option for value, for
the kind of NLP tasks I have at hand.
Stewart Harding (http://xposure4all.com) says

Great blog my friend
Chan says
Hi Tim,
There are a number of brands for the GTX 1080 Ti such as Asus, MSI, PALIT,
ZOTAC etc. May I know does the brand matter ? I am planning to get a GTX 1080 Ti
for my deep learning research, but not sure which brand to get.
Thank you.
Tim Dettmers says

In terms of deep learing performance the GPU itself are more or less the same
(overclocking etc does not do anything really), however, the cards sometimes
come with differernt coolers (most often it is the reference cooler though) and
some brands have better coolers than others. The choice of brand shoud be
made rst and foremost on the cooler and if they are all the same the choice
should be made on the price. So the GPUs are the same, focus on the cooler
rst, price second.
foojpg says
Hey Tim,
I’m in a confused state of mind right now. I currently have a GTX 960 4gb, which in
selling. After that I’ll have enough cash to buy either a GTX 980 4gb or a GTX 1060
6gb. I plan to get serious with DL. I know most are saying the 1060 but it doesn’t
have SLI! What if I want to upgrade in 5-6 months (just in case I suddenly get
extremely serious)? Please help me.
Thanks
Thanks
Tim Dettmers says

SLI is used for gaming only, you do not need it for parallelization (for CUDA
computing the direction connection via PCIe is used). So If you want to get two
GTX 1060 you can still parallelize them — no SLI required.
Haris Jabbar says

I was waiting for this very update, i.e. your recommendation after GTX1080Ti
launch. As always, a very well rounded analysis.
I am building a two GPU system for the sole purpose of Deep Learning research and
have put together the resources for two 1080Tis
(https://de.pcpartpicker.com/list/GMyvhq
(https://de pcpartpicker com/list/GMyvhq)) First I want to thank for your earlier
(https://de.pcpartpicker.com/list/GMyvhq)). First I want to thank for your earlier

posts because I used your advice for selecting every single component in this setup.
Secondly, there is one aspect you haven’t touched and I was wondering if you had
any pointers to that. It’s about cooling and it’s effect on higher FLOPS. Having
settled on dual 1080Ti system, now I have to select among stock cooling from FE or
elaborate air cooling from AIBs or custom liquid cooling. From what I understand,
FLOPS are a directly proportional to GPU frequency, so cooling the system to run
the GPUs at higher clock rate should theoretically give linear increase in
performance.
1) In your experience, is the linear increase in performance seen in practice too? On

a related note, since you emphasized so much on memory bandwidth, is it the case
that all/most DL setups are memory bound? because in that case, increasing
compute performance won’t be of any use.
2) In earlier posts you recommended to go with RAM with slowest frequency as it

will not be a bottleneck. From your argument in that post, I understand that a
2133MHz RAM would be good enough.
Thank you again for your time and these posts.
Tim Dettmers says

1) Bad cooling can reduce performance signi cantly. However, 1. case cooling
1) Bad cooling can reduce performance signi cantly. However, 1. case cooling
does hardly anything for the GPUs, 2. the GPUs will be suf ciently cooled if you
use air cooling and crank up the fans. I never tried water cooling, but this should
increase performance compared to air cooling under high loads when the
GPUs overheat despite max air fans. This should only occur if run them for
many hours in a unventilated room. If this is the case, then water cooling may
make sense. I have no hard numbers on the performance gain, but in terms of
hardware, cooling is by far the biggest gain of performance (I think you can
expect 0-15% performance gains). So if you are willing to put in the the extra
work and money for water cooling, and you will run your GPUs a lot, then it
might be a good t for you.
2) 2133MHz will be ne. Theoretically, the performance loss should be almost

unnoticeable and probably in the 0-0.5% range compared to the highest end
RAM. I personally run 1600MHz RAM and compared to other systems that I
have run on, I could not detect any degradtion of performance.
Haris says
Thank you for prompt reply. I think I will stick to air cooling for now and
keep water cooling for a later upgrade.
Grey says
Hi Tim, Thanks for the informative post. I am currently looking at the 1080 TI. Right
now I am running a LSTM(24) with 24 time steps. I generally use Theano and
TensorFlow. What kind of speed increase would you expect from buying 1 1080TI
as opposed to 2 1080 TI cards. Just trying to gure out if its worth it. Should I buy a
SLI bridge as well, does that factor in? Thanks , really enjoyed reading your blog.
Tim Dettmers says

LSTM scale quite well in terms of parallelism. The longer your timesteps the
better the scaling. Theano and TensorFlow have in general quite poor
parallelism support, but if you make it work you could expect a speedup of
about 1.5x to 1.7x with two GTX 1080 Ti compared to one; if you use
frameworks which have better parallelism capabilities like PyTorch, you can
expect 1.6x to 1.8x; if you use CNTK then you can expect speedups of about
1.9x to 1.95x.
These numbers might be lower for 24 timesteps. I have no hard numbers of

when good scaling begins in terms of parallelism, but it is already dif cult to
utilize a big GPU fully with 24 timesteps. If you have tasks with 100-200
timesteps I think the above numbers are quite correct.
Bill Simmons (http://www.spoatech.com) says

Now that the GTX 1080TI is based on Pascal, what would be the difference in using
that card verses the Titan X Pascal for DNN training, whether it be for vision,
speech or most other complex networks?
Tim Dettmers says

The performance is pretty much equal, the only difference is that the GTX
1080 Ti has only 11GB which means some networks might not be trainable on
it compared to a Titan X Pascal.
Somebody Else's Lover says

After the release of 1080ti, you seem to have dropped your recommendation of
1080. You only recommend 1080ti or 1070 but why not 1080, what is wrong with
it? It seems it has signi cantly better performance than 1070, so why not
recommend 1080 as a budget but performant gpu? is it really waste of money to
buy it, if so, why?
Thanks
Tim Dettmers says

Thank you, that is a valid point. I think the GTX 1080 does not have a good
performance/costs ratio and has no real niche to ll. The GTX 1070 offers good
performance, is cheap, and provides a good amount of memory for its price; the
GTX 1080 provides a bit more performance but not more memory and is quite
GTX 1080 provides a bit more performance, but not more memory and is quite
a step up in price; the GTX 1080 Ti on the other hand offers even better
performance, a 11GB memory which is suitable for a card of that price and that
performance (enabling most state-of-the-art models) and all that at a better
price than the GTX Titan X Pascal.
If you need the performance, you often also need the memory. If you do not
need the memory, this often means you are not at the edge of model
performance, and thus you can wait a bit longer for your models to train as
these models often do not need to train for that long anyways. Along this ride,
you also save good chunk of money. I just do not see a very solid use-case for
the GTX 1080 other than “I do not want to run state-of-the-art models, but I
have some extra money, but not too much extra money for a GTX 1080 Ti, and I
want to run my models faster”. This is a valid use-case and I would recommend
the GTX 1080 for such a situation. But note that this situation is rare. For most
people either the GTX 1070 is already expensive, or the GTX 1080 Ti is cheap
and there are few people in-between.
Also note that you can get almost two GTX 1070 rather than one GTX 1080. I
would recommend two GTX 1070 over one GTX 1080 any day.
Somebody Else's Lover says

thanks for your detailed reply, but gtx 1080 price dropped rapidly after the
release of 1080ti, its price gap with 1070 has narrowed signi cantly. To tell
the truth, I purchased msi armor 1080 cheaper price than msi gaming x
1070 thanks to the weekend discount but after your deliberate
1070, thanks to the weekend discount but after your deliberate

disregarding 1080, I somehow had happened to start doubting my choice,
which you already cleared up
thanks again,
greetings from Turkey .
Tim Dettmers says

I did not know that the price dropped so sharply. I guess this means
that the GTX 1080 might be a not so bad choice after all. Thanks for
the info!
Eric Perbos says

Tim Dettmers’s response is very logical as based on MSRP (Manufacturer
Tim Dettmers s response is very logical as based on MSRP (Manufacturer

Suggested Retail Price) which show little bene t into paying 30-40% additional
dollars for a current GTX 1080 instead of a GTX 1070 as both have 8 Gb
memory, which can often be a key factor into optimizing your ML/DL sessions.
The Gb size (4Gb for a 1050 Ti, 6Gb for a 1060+ and 8Gb for botth
1070/1080) is probably the most important factor, before bandwith
performance (check Nvidia for proper comparison).
Now the truth is that many retailers are, indeed, discounting the current GTX
1080 cards in stock heavily (anticipating the “new” 1080 version with better
memory/bandwith performance) that the price spread between 1070 and
1080 is nowhere near the 30-40% of cial MSRP.
I’ve seen in Europe price differences between current 1070 and 1080 limited
to 10-15% max due to heavy discounts.
So if you can get a current 1080 for say USD 450 vs a 1070 for USD 400: for
sure get the 1080 if you can afford.
Hope this helps.
Tim Dettmers says

That makes very much sense. Thanks for your comment!

Thank you for sharing. This thread is very helpful.

1080 Ti is out of stuck in the NVIDIA store now. Do you know when it will on the
stuck again?
Tim Dettmers says

This happened with some other cards too when they were freshly released. For
some other cards, the waiting time was about 1-2 months I believe. I do not
know if this is indicative for the GTX 1080 Ti, but since no further information
is available, this is probably what one can expect.
Eric Perbos says

Heja Amir,
The online offerings for the GTX 1080 Ti are getting wild now as the rst batch,
the Founder Edition (FE), from usual suspects (Asus, EVGA, Gigabyte, MSI,
Zotac) came into retail on March 10.
Now the second batch, custom versions with dedicated cooling and sometimes
overclocking from the same usual suspects, are coming into retail at a similar
price range.
As a result, not only will you see plenty of inventory available in both FE and
custom versions.
But you will see some nice “refurbished” bargains as early-adopters of the FE
are sending back their cards to upgrade to a custom version, (ab)using the right
to return online purchase within 7-15 days.
For example, I just snatched a refurbished Asus GTX 1080 Ti FE from our local
“NewEgg” in Sweden for SEK 7690 instead of SEK 8490 local of cial price.
Nice 10% discount easy to grab, the card still had plastic lms on so the
previous owner was obviously planning his (ab)use of consumer rights.
Hope this helps.

Dear Eric,
Thank you. Currently, I can preorder custom versions from EVGA
(http://www.evga.com/products/productlist.aspx?
type=0&family=GeForce+10+Series+Family&chipset=GTX+1080+Ti
(http://www.evga.com/products/productlist.aspx?
type=0&family=GeForce+10+Series+Family&chipset=GTX+1080+Ti)):
Do you suggest these custom versions (for example:
http://www.evga.com/products/product.aspx?pn=11G-P4-6393-KR
(http://www.evga.com/products/product.aspx?pn=11G-P4-6393-KR)) for
deep learning researches or you prefer the founder edition?
I’m scared if overclocking is not appropriate for deep learning research
when you run a program for a long times.
Haider says
Hi Tim,
If I have a system with one 1080ti GPU, will I get x2 performance if I add another
one?
one?
The CPU is core i7 – 3770K , and its maximum RAM possible is only 32GB, also if I
add another GPU the PCIv3.0 lanes will drop from 16 lanes to 8 lanes for each.
I will use them for image recognition, and I am planning to only run other attempts
with different con gurations on the 2nd GPU during waiting for the training the 1st
GPU. I am kind of new to DL and afraid that it is not so easy to run one Network on
2 GPUs,
So probably training one network in one GPU, and training another in the 2nd will
be my easiest way to use them.
My concern is about the RAM, will it be enough for 2 GPUs?

or the CPU is it fast enough to deal with 2 convNets on 2 GPUs?
Will my system be the bottleneck here in a two GPU con guration which makes it
not worth the money to buy another 1080ti GPU?
And what if I buy a lower performance GPU with the 1080ti, like the GTX 1080?
Any problem with that?
Thank you for this unique blog. A lot of software advice are there in DL, but in
Hardware, I barely nd anything like yours.
Many Thanks
Tim Dettmers says

The performance depends on the software. For most library you can expect a
speedup of about 1.6x but along that comes additional multi-GPU code that
you need to write, but with some practice this should become second nature
quickly. Do not be afraid of multi-GPU code.
32 GB is more than ne for the GPUs. I am working quite comfortable at 24GB

with two GPUs. 8 lanes per GPU can be a problem when you parallelize GPUs
and you can expect a performance drop of roughly 10%. If the CPU and RAM is
cheap then this is a good trade-off.
The CPU will be alright, you will not see any performance drop due to the CPU.
If you just getting started I would recommend two GTX 1070 instead of the
expensive big GTX 1080 Ti. If you can nd cheap GTX 1080 this might also be
worth it, but a GTX 1070 should be more than enough if you just start out in
deep learning.
Thomas says
I am looking to getting into deep learning more after taking the Udacity Machine
Learning Nanodegree. This will likely not be a professional pursuit, at least for a
while, but I am very interested in integrating multiple image analysis neural
networks speech/text processing and generation and possibly using physical
networks, speech/text processing and generation, and possibly using physical

simulations for training. Would you recommend 2 GPUs, one to run the deep
learning nets and one to run the simulation, is that even possible with things like
OpenAI’s universe? Also, do you see much reason to buy aftermarket overclocked
or custom cooler designs with regard to their performance for deep learning?
Greatly appreciate this blog any insight you might have as I look to update my old rig
for new pursuits.
Tim Dettmers says

What kind of physical simulations are you planning to run? If you want to run
uid or mechanical models then normal GPUs could be a bit problematic due to
their bad double precision performance. If you need double precision for your
simulation I would go with an old GTX Titan (Kepler, 6GB) from eBay for the
double precision, and a current GPU for deep learning. If you run simulations
that do not require double precision then a current GPU (or two if you prefer)
are best.
Overclocked GPUs do not improve performance in deep learning. Custom

cooler designs can improve the performance quite a bit and this is often a good
investment. However, you should check benchmarks if the custom design is
actually better than the standard fan and cooler combo.
Thomas says
The simulations, at least at rst, would be focused on robot or human

modeling to allow a neural network more ef cient and cost effective
practice before moving to an actual system, but I can broach that topic
more deeply when I get a little more experience under my belt.
My most immediate interest is whether I should look at investing in a
single aftermarket 1080 Ti (with the option to add another later on) or
something closer to 2x 1070s when working with video/language
processing (perception, segmentation, localization, text extraction,
geometric modeling, language processing and generative response)?
Also, looking into the NVidia drive PX system, they mention 3 different
networks running to accomplish various tasks for perception, can separate
networks be run on a single GPU with the proper architecture?
Tim Dettmers says

Yes you can train and run multiple models at the same time on one
GPU, but this might be slower if the networks are big (you do not lose
performance if the networks are small) and remember that memory is
limited I think I would go with a GTX 1070 rst and explore your tasks
limited. I think I would go with a GTX 1070 rst and explore your tasks
from there. You can always get a GTX 1080 Ti or another GTX 1070
later. If your simulations require double precision then you could still
put your money into a regular GTX Titan. I think this is the more
exible and smarter choice. The things you are talking about are
conceptually dif cult, so I think you will be bound by programming
work and thinking about the problems rather than by computation —
at least at rst. So this would be another reason to start with little
steps, that is with one GTX 1070.
ravi jagannathan says

This is very useful post. Is there an assumption in the above tests, that the OS is
linux e.e, the deep learning package runs on Linux. or does it not matter.
I am considering a new machine, which means a sizeable investment.
It is easy to buy a windows gaming machine with the GPU installed “off the shelf”
whereas there are few vendors for linux desktop . And there is side bene t of using
the machine for gaming too.
Tim Dettmers says

GPU performance is OS independent since the OS barely interacts with the

GPU. A gaming machine with preinstalled windows is ne, but probably you
want to install Linux along-side of windows so that you can work easier with
deep learning software. If you have just one disk this can be a bit of a hassle due
to bootloader problems and for that I would recommend getting two separate
disk and installing an OS on each.
Phillip Glau says

Great followup post to your 2015 article.
One thing I would add is the the cooling system of various cards makes a difference
if you’re going to stack them together in adjacent PCI slots. I have two GTX-1070s
from EVGA that have the twin mounted fans that vent back into the box. (The “SC”
Model) If I had it to do over again I would get either the ‘blower’ model which vents
Model) If I had it to do over again, I would get either the blower model which vents
out the back or the water cooled version.
For a moment, I had 3 cards, and (two 1070s and one 980ti) and I found that the
waste heat of one card pretty much feed into the intake of the cooling fans of the
adjacent cards leading to thermal overload problems. No reasonable amount of
case fan cooling made a difference.
In my current setup with just the two 1070s, they’re spaced with one empty PCI
slot between them so it doesn’t make much difference, but I suspect with four cards
the “SC” models would have been extremely problematic.
Thanks again for both this post as well as your earlier 2015 post.
Tim Dettmers says

From my experience the ventilation within a case has very little effect of
performance. I had a specially designed case for air ow and I once tested
deactivating four in-case fans which are supposed to pump out the warm air.
The difference was equivalent of turning up the fan speeds of the GPU by 5%.
So it may make a difference if your cards are over 80 °C and your fan speeds
are at 100%, but otherwise it will not improve performance. So in other words,
the exhaust design of a fan is not that important, but the important bit is how
well it removes heat from the heatsink on the GPU (rather than removing hot
air from the case). If you compare fan designs try to nd benchmark which
actually test this metric
actually test this metric.
If there are cooling issues though, then the water cooling de nitely makes a
difference. However, for that to make a difference you need to have cooling
problems in the rst place and it involves a lot more effort and to some degree
maintenance. With four cards cooling problems are more likely to occur.
Paris says
Hi Tim,
Thank you for the great article! I am in the process of building a deep learning / data
science – kaggle box in the 2-2,5k range.
I was going for the gtx 1080 ti, but your argument that two gpus are better than one
for learning purposes caught my eye.
I am planning on using the system mostly for nlp tasks (rnns, lstms etc) and I liked
the idea of having two experiments with different hyper parameters running at the
same time. So the idea would be to use the two gpus for separate model trainings
and not for distributing the load. At least that’s the initial plan.
Also since we are talking about text corpus I guess the 6gb of vram would work.
On the other hand I’ve read that rnns don’t work well with multiple gpus, so I might
experience problems using both of them at the same time.
Taking all that into account would you suggest eventually a two gtx 1070, two gtx
1080 or a single 1080ti? I am putting the 1080ti into the equation since there might
be more to gain by having a 1080ti.
Tim Dettmers says

Hi Paris,
I think two GTX 1070, or maybe even a single GTX 1070 for a start, might be a
good match for you. There might be some competitions on kaggle that require a
larger memory, but this should only be important to you if you are crazy about
getting top 10 in a competition (rather than gaining experience and improving
your skills). To make the choice here which is right for you. Since competition
usually take a while, it might also be suitable to get a GTX 1070 and if your
memory holds you back on a competition to get a GTX 1080 Ti before the
competition ends (another option would be to rent a cloud based GPU for a
few days). In terms of data science you will be pretty good with a GTX 1070.
Most data science problems are dif cult to deal with deep learning,so that
often the models and the data are the problem and not necessary the memory
size. For general deep learning practice a GTX 1070 works well especially for
NLP you should have no memory problems in about 90% of the cases and in
those cases you can just use a “smarter” model
those cases you can just use a smarter model.
Hope this helps.
Paris says
Thank you very much Tim for taking the time to reply back!
Alex Ekkis says

Thanks for keeping this article updated over such a long time! I hope you will
continue to do so! It was really helpful for me in deciding for a GPU!
Tim Dettmers says

I will de nitely keep it up to date for the foreseeable future. Glad that you
found it useful
foojpg says
Hey Tim,
I’m about to buy my parts, but I’m facing one big issue. I’m getting a gtx 1060, but I
can’t nd the budget to t in a motherboard with 2 pcie x16 slots. This means I can’t
SLI the 1060 in the future. Should I keep saving up, or is it better to just sell my old
1060 and buy a 1080 ti/higher end gpu when the time comes?
Thanks
Tim Dettmers says

That is a dif cult problem. It is dif cult to say what your needs will be in the
future, but if you do not use parallelization with those two GPUs it is very
similar to a single GTX 1080 Ti — so in that case buying the GTX 1060 with one
PCIe slot will be good. If you really want to parallelize, maybe even two GTX
1080 Ti, it might be better to wait and save up for a motherboard with 2 PCIe
slots. Alternatively, you could try to get a cheaper, used 2 PCIe slot
motherboard from eBay.

I buy a GTX 1080 TI for deep learning research. I want to install my new card on my
old desktop which has ASUS P6T motherboard
(https://www.asus.com/us/Motherboards/P6T/speci cations/
(https://www.asus.com/us/Motherboards/P6T/speci cations/)). According to the
speci cations, this motherboard contains 3 x PCIe 2.0 x16 (at x16/x16/x4 mode)
slots but GTX 1080 ti compatible with PCIe V3.0.

I just want to know if it’s possible to install 1080ti on my motherboard (ASUS P6T)?
If yes (it seems PCIe v3 is compatible with PCIe v2), if I face with some bottleneck
for leveraging PCIe v2 instead of PCIe v3?
for leveraging PCIe v2 instead of PCIe v3?

Do you suggest to upgrade the motherboard of use the old one?
Tim Dettmers says

If you use a single GTX 1080 Ti, the penalty in performance will be small
(probably 0-5%). If you want to use two GPUs with parallelism you might face
larger performance penalties between 15-25%. So if you just use one GPU you
should be quite ne, no new motherboard needed. If you use two GPUs then it
might make sense to consider a motherboard upgrade.

Thank you for your valuable comments. I do appreciate your help.
A i H J didi j d

Dear Tim,
Would you please consider the following link?
http://stackover ow.com/questions/43479372/why-tensor ow-utilize-
less-than-20-of-geforce-1080-ti-11gb
(http://stackover ow.com/questions/43479372/why-tensor ow-utilize-
less-than-20-of-geforce-1080-ti-11gb)
Is it possible that using PCIe v2 leads to this issue (low GPU utilization)?
Tim Dettmers says

It is likely that your model is too small to utilize the GPU fully. What
are the numbers if you try a bigger model?

2017-04-23 at 19:04
(http://timdettmers com/2017/04/09/which-gpu-for-deep-
You’re right. Running a bigger model leads to better utilization.

Thank you.
Eric Perbos says

Just beware, if you are on Ubuntu, that several owners of the GTX 1080 Ti are
struggling -here and there- to get it detected by Ubuntu, some failing totally.
Those familiar with the history of Nvidia and Ubuntu drivers will not be
surprised but nevertheless, be prepared for some headaches.
In my case, I had to keep an old 750 Ti as GPU #1 in my rig to get Ubuntu 16.04
to start (GTX 1080 Ti as GPU #0 would not start).
Mario says
You mean “an old 750 Ti as GPU #1” -> “an old 750 Ti as GPU #0”?
Bhanu says
Hi Tim Dettmers,
I am working on 21gb input data which consists of video frames. I need to apply
deep learning to perform classi cation task. I will be using cnn, lstm, transfer
learning. Among Tesla k80, k40 and GeForce 780 which one do you recommend?
Are there any other GPU’s which you recommend. Going through your well written
article, I could also think on Titan X or GTX 1080 Ti. Do I need to use multiple gpu’s
or a single gpu?
Tim Dettmers says

Hi Bhanu,
Hi Bhanu,
The Tesla k80 should give you the most power for this task and these models.
The GTX 780 might limit you in terms of memory, so probably k40 and k80 are
better for this job. The GTX 780 might be good for prototyping models. In
terms of performance, there are no huge difference between these cards.
For your task, if you work in research, I would recommend a GTX Titan X or a
GTX Titan Xp depending on how much money you have. If you work in industry,
I would recommend a GTX 1080 Ti, as it is more cost ef cient, and the 1GB
difference is not such a huge deal in industry (you can always use a slightly
smaller model and still get really good results; in academia this can break your
neck).
Henry says
I am building a computer right now 2,000 budget and I am going with the Asus GTX
1080. Should I go with something a little less powerful or should i go with this. I
really care about graphics. (games I want to get are X-com 2, Player Unknowns
Battlegrounds, Civ 6, The new mount and blade) games like that
Tim Dettmers says

I do not know about graphics, but it might be a good choice for you over the
GTX 1070 if you want to maximize your graphic now rather than to save some
money to use it later to upgrade to another GPU. If you want to save some
money go with a GTX 1070. I guess both could be good choices for you.
foojpg says
Hey Tim,
I already have a gtx 960 4gb graphics card. I’m faced with two options – to buy a
used 960 (same model) so I can have 2 960s, or I can sell my 960 and buy a used
1060 6gb. Which one will he better? I’ve heard for gaming the 1060 will be better,
but how will it affect DL?
Thanks
P.S – Please note that the price for both paths will be similar (with the 960 path
being more expensive by around 25 dollars)
https://www.behance.net/bestlaptopsunder
(https://www.behance.net/bestlaptopsunder) says
Helpful info. Fortunate me I found your web site by chance, and I’m shocked why
this accident didn’t came about earlier! I bookmarked it.
Elkhan says
Hey Tim,
Thanks for great post. Wondering if you will include 2017 version Titan XP in your
comparisons soon too.
I’m planning to build my own external GPU box mainly for Kaggle NLP competitions.
Yesterday Nvidia introduced new Titan XP 2017 model.

I’m planning to buy Nvidia GPU, and use it as external GPU for NLP Deep Learning
tasks.
1) Should i go with 1 Titan XP 2017 model ? Or still 2 X GEFORCE GTX® 1080 Ti

will be better ?
2) What about using GPU externally with my already existing Mac or Windows
laptop, connecting via Thunderbolt ?
3) What are your thoughts about TPU which Google introduced recently ?
External GPU box :
MAC: AKiTiO 2 ,
Windows: Razer Core, Alienware Graphics Ampli er, or MSI Shadow
Bizon box: https://bizon-tech.com/ (https://bizon-tech.com/)
Blog Post: http://www.techrepublic.com/article/how-to-build-an-external-gpu-for-
4k-video-editing-vr-and-gaming/ (http://www.techrepublic.com/article/how-to-
build-an-external-gpu-for-4k-video-editing-vr-and-gaming/)
Thanks.
Tim Dettmers says

I might update my blog post this evening.
1) In your case I would not recommend the Titan Xp; two GTX 1080 Ti are
de nitely better.
2) That should work just ne but in some cases you might see a performance
2) That should work just ne but in some cases you might see a performance
drop of 15-25%. In most cases you should only see a performance drop of 5-
10% though.
3) The TPU is only for inference, that is you cannot use it for training. It is
actually quite similar to the NVIDIA GPUs which exist for the same purpose.
Both the NVIDIA GPU and the Google TPU are generally not really interesting
for researchers and normal users, but are for (large) startups and companies.
kartheek says
Hay Tim, Its great article!. I am new to ML. Currently i have a mac mini. i found few
alternatives to add external graphic card through thunderbolt port. Can i run ML
and Deep learning algorithms on this?
Tim Dettmers says

I have never seen reviews on this, but theoretically it should just work ne. You
I have never seen reviews on this, but theoretically it should just work ne. You
will see a performance penalty though which depending on the use case is
anywhere between 5-25%. However, in terms of cost this might be a very
ef cient solution since you do not have to buy a full new computer if you use
external GPUs via thunderbolt.
Adam says
You’ve talked a bit about it in various comments but it would be great if we could get
your thoughts on the real world penalties of running PCIe 3.x cards in PCe 2.x
systems. I’m guessing that single GPU setups the reduced bandwidth would have
minimal impact but what about multi-GPU con gurations?
Adam says
Another thing I’d be curious to hear your thoughts on is the performance

penalty of locating GPUs in x8 PCIe slots.
Tim Dettmers says

I do not think you can put GPUs in x8 slots since they need the whole x16
connection to operate. In the case if you mean putting them in x16 slots
but running them with 8x PCIe lanes, this will be okay for a single GPU and
for 3 or 4 GPUs this is the default speed. Only with 2 GPUs you could have
16x lanes, but the penalty of parallelism on 8x lane GPUs is not to bad if
you only have two GPUs. So in general 8x lanes per GPUs are ne.
Tim Dettmers says

The impact will be quite great if you have multiple GPUs. It is dif cult to say
how big it will be because it varies greatly between models (CNN, RNN, a mix
of both, the data formats, the input size) and can also differ a lot between
architecture (Resnet vs VGG vs AlexNet). What I can say, that if you use
multiple GPUs with parallelism then an upgrade form PCIe 2.0 to PCIe 3.0 and
an upgrade in PCIe lanes (32 lanes for 2 GPUs, or 24 lanes for 3, 36 lanes for 4
GPUs) will be the most cost ef cient way to increase the performance of your
GPUs) will be the most cost ef cient way to increase the performance of your
system. Slower cards with these features will often outperform more expensive
cards on PCIe 2.0 system or systems with not enough lanes for 16x (2 GPUs)
or 8x speed (3-4 GPUs).
Ghulam Ahmed says

HI, I have a GTX 650Ti 1GB GDDR5. How is it for starters?
Tim Dettmers says

It will be slow and many networks cannot be run on this GPU because its
memory is too small. However, you will be able to run cuDNN which is a big
plus and you should be able to run examples on MNIST and other small
datasets without a problem and faster than on the CPU. So you can de nitely
use it to get your feet wet in deep learning!
hashi says
Hi,
Thank you very much for providing useful information! I’m using AWS P2(most
cheapest one) but planing to switch other GPU environment, for example DELL’s
desktop or laptop. What is different between laptop GPU and desktop GPU for
training deep learning networks ? For example, GTX 1060 6GB on laptop and on
desktop. GPU memory band width?
Cheers,
Hashi
Tim Dettmers says

For newer GPUs, that is the 10s series there is no longer any real difference
between laptop and desktop GPUs (this means the GTX 1060 is very similar to
the GTX 1060 laptop version). For earlier version the laptop version often has
smaller bandwidth mostly; sometimes the memory is smaller as well Usually
smaller bandwidth mostly; sometimes the memory is smaller as well. Usually

laptop GPUs consume less energy than desktop GPUs.
Martin Thoma (https://martin-thoma.com/) says

Thanks for the post. (There is a small typo: “my a small margin”)
Tim Dettmers says

Fixed — thanks for pointing out the typo!
Bruce says

Thanks for the article. I have a McBook Pro and considering recent release of Mac
drivers for the Pascal architecture I am considering getting an external GPU rig that
would run over Thunderbolt 3. Any concerns with this? Do you know how much
penalty I would pay for having the GPU be external to the machine? It appears on
the surface that PCIe and Thunderbolt 3 are pretty similar in bandwidth.
Previously for deep learning research I have been using Amazon instances.
Tim Dettmers says

Thunderbolt 3 is 5GB/s compared to PCIe which 16 GB/s. If you use three or

more GPUs the bandwidth of PCIe will shrink to 8GB/s. If you use just one
GPU the penalty should be in the range of 0-15% depending on the task.
Multiple GPUs should also be ne if you use them separately. However, do not
try to parallelize across multiple GPUs via thunderbolt as this will hamper
performance signi cantly.
Spencer says
Fantastic article!
I’m interested in starting a little beowulf cluster with some ViA mini-itx boards and I
was wondering how I could add gpu compute to that on a basic level. They only have
pcie x4, but I could use a riser. I was thinking the Zotac GT710 pcie x1 card – one on
each board.
Tim Dettmers says

Unfortunately, the GT 710 would be quite slow, probably on-a-par with your
CPU. I am not sure how well the GPUs are supported if you just connect them
via a riser. If it works for PCIe x16 cards then this would be an option to go. If
you just want a cheap GPU and the x16 thing works, then you can go with a
GTX 1050 Ti.
Spencer says
Here is the board I am looking at. I’m planning on having like 40 of these
rackmounted all in a cluster, and each with a gpu in it. I’m not looking for
hyper power, just something fun to mess with. The riser idea sounds good!
I’m just looking for budget stuff here, and I gured many low power
devices is as good as one high power device.
Spencer says
http://www.viatech.com/en/boards/mini-itx/epia-m920/
(http://www.viatech.com/en/boards/mini-itx/epia-m920/)
Forgot the link
Tim Dettmers says

I have no idea if that will work or not. The best thing would be to try it
for one card and once you get it running roll it out to all other
motherboards.
Spencer says
2017-04-24 at 09:42
Sounds good! Still in the planning phase, so I may revise it quite a

bit. I really appreciate your help!

Hello Tim,
Except if I missed the information on the post, you should mention newest Nvidia
cards limit the number of cards on a PC, GTX 1080Ti seems to be limited to two
cards, this could be a main issue if we want several experiments on parallel or using
multi GPU with CNTK (and PyTorch) Knowing that I am not sure to build a rig with
multi-GPU with CNTK (and PyTorch). Knowing that, I am not sure to build a rig with
GTX 1080Ti if I want to level up the system in the future.
Cheers,
mph
Tim Dettmers says

Hello Marc-Philippe,
I think you are confusing NVIDIA SLI HB limitations with PCIe slot limitations.
You will only be able to run 2 GTX 1080 Ti if use you SLI HB, but if you use
compute you are able to use up to 4 GPUs per CPU. SLI and SLI HB are not
used for compute, but only for gaming. Thus there should be no limitations in
the number of GTX 1080 Ti you can run, besides the CPU and PCIe slot
limitations.
chanhyuk jung says

I live in Korea and the electricity bills are very expensive here, so I prioritized power
ef ciency over performance. But I couldn’t decide from reviews from gamers
showing performance per frames per seconds. So what is the most power ef cient
GPU? (It doesn’t matter if it’s from amd)
Tim Dettmers says

Bigger GPUs are usually a bit more power ef cient if you can fully utilize them.
Of course this depends on the kind of task you are working on. If you look at
performance / Watts, then all cards of the series 10 are about the same so it
really depends how large your workloads are and optimize for that. That is get a
series 10 card of size which ts your models. You should prefer series 10 cards
over series 900 cards since they are a bit more energy ef cient for the
performance they offer.
Matt Sandy (http://appup.io/) says

It would be interesting to see what happens when using an eGPU housing over
Thunderbolt.
Tim Dettmers says

Indeed, many people have asked about this and I would also be curious about
the performance. However, I do not have the hardware to make such tests. If
someone has some performance results it would be great if somebody could
post them here.
Anh says
HI tim
May i ask you question, i am on a low budget situation and i am weighting between
gtx970m and gtx1050ti (mobile version) coud you give me an advice on which one
should i get
should i get
Tim Dettmers says

The cards are very similar. I think the GTX 1050 Ti (notebook) might be slightly
better in performance, but in the end I would make the decision based on the
cost, since the GPUs are quite similar.
Mahesh Govind (http://www.digiledge.com) says

Thank you for the good article.
I have a machine with 2 titan X pascals and 64 GB ram .

Do you recommend to run two separate models simultaneously . Or parallelize a
model across two GPUs .
regards
Mahesh
Tim Dettmers says

With 2 GPUs parallelism is still good without any major drawback in scaling. So
you could run them in parallel or not, that depends on your application and
personal preference. If your models run for many days, I would go for
parallelism and otherwise just use them simultaneously.
Phil says
Hi,
First, thanks for this great article.
Do you think I could install a GTX 1060 (6GB) on the following con guration :
Processor : Intel Pentium G4400
Processor : Intel Pentium G4400

Integrated GPU : Intel HD Graphics 510
RAM : 4GO DIMM DDR4
Motherboard : Asus H110M-K
Nvidia is telling me that GTX 1060 requires at least a core i3 to run, but I’m seeing
on CPU benchmark that G4400 is not that bad compared to some versions of core
i3, so I’m lost….
Thanks a lot
Phil
Tim Dettmers says

The CPU should be ne. I think NVIDIA is referring to gaming performance

rather than CUDA performance. For gaming performance this might be true,
but for deep learning it should have almost no effect. So it should be ne.

Hi Tim,
This is a very good and interesting article, indeed!
What I’m still confused about is FP16 vs FP32 performance of the GTX 1000 series
cards. In particular, I’ve read that deep learning uses FP16 and GTX 1000 series are
too slow on FP16 (1:64), which means NVIDIA forces users of deep learning tools
to buy a signi cantly more expensive Tesla or Quadro card.
I’m very new to deep learning and I would expect that an algorithm that requires
FP16 accuracy could also be used with FP32 accuracy, is this not the case? If a card
doesn’t support the performance optimisations required for doubling performance
with FP16, I expect we would be limited by its FP32 performance. However, in this
case, I don’t get it why NVIDIA decided to cap the performance of FP16 on these
cards, i.e. why not let them perform in FP16 similarly to FP32.
Thanks
Tim Dettmers says

There are two different ways to use FP16 data:
(1) Store data in FP16 and during computation cast it to FP24 and use FP32
computation units or in other words: Store in FP16 but use FP32 for compute
computation units, or in other words: Store in FP16 but use FP32 for compute.
(2) Store data in FP16 and use FP16 units for compute
Almost all deep learning frameworks use (1) at the moment, because only one
GPU, the P100, has FP16 units. All other cards will have very, very poor FP16
performance. I do not know why NVIDIA decided to cap performance on most
10 series cards. It might be a marketing ploy to get people to buy P100, or it is a
hardware constraint and it is just dif cult to put both, FP16 and FP32 compute
units, on a chip and make it cheap at the same time (thus only FP32 on
consumer cards to make them cheap to produce).
But in general, it is as you said, if an algorithm runs in FP16 you can expect to be
able to run it in FP32 just ne — so there should be no issues with compatibility
or such.

Thanks for your response. My understanding was that NVIDIA had

implemented FP16 in a smart way that reused FP32 to effectively double
performance. I presumed this feature is implemented in rmware and sold
at a premium, within the TESLA and Quadro product lines, similarly to the
rmware-based (as opposed to hardware-based) implementation of ECC
memory.
If FP16 is natively implemented in separate (hardware) circuitry within the

GPU, it would indeed make economic sense for NVIDIA to exclude that
from the consumer-grade product. Even if this is the case, though, since it’s
possible to cast FP16 to FP32 for compute, I can’t imagine why NVIDIA
has not implemented FP16 by casting it to FP32, in the rmware. It’s not a
question, just an observation that has left me puzzled.
Felix (https://github.com/aktivkohle) says

Hi Tim, thanks for that! Very comprehensive article.. I now have the GeForce GTX
1050Ti installed and running. Chose it as the PC is from 2010 (Dell Precision
T1650) without changing the power supply can only supply 75W from the board.
Also, read somewhere that other cards won’t t and noticed that as I put it in it was
physically millimetres away from the spot where the hard drives plug in. So imagine
even if you changed the power supply in this unit for something that could run the
other cards, they are probably bigger and will hit into things and so not even push
into the slot. Not sure if the other cards even use the same kind of slot that was
around in 2010.
Can see it is about 1/5 as fast as the fastest cards but as per your cost analysis, the
best bang for your buck. Took a whole day of painful trial and error to get cuda 8.0
and cudnn 5 properly installed, but it works now nally on Linux Mint, the last
niggling issues were extra lines needed in bashrc
niggling issues were extra lines needed in bashrc.
Just tested some style transfer and it took 9 minutes compared with hours that it
took on the CPU, each iteration took like 0.25 seconds rather than like 10 seconds
or so..
Tim Dettmers says

Hi Felix, thanks for your story! It seems that everything worked out well for you
— it always make me happy to hear such stories! Indeed, the size of GPUs and
the power requirements can be a problem and I think the GTX 1050 Ti was the
right choice here.
Fabien says
Hi Tim,
Very interesting article, thanks very much for your work!
I expect to participate a few Kaggle competitions for fun and challenge, as well as
experimenting for myself
experimenting for myself.

I need to change my GTX750ti (too slow) and I am hesitating between GTX1060
6GB and GTX1070.
The best deals for 1060 are currently around 250€ but I just found a 1070 at 300€,
would you say it’s worth it?
Tim Dettmers says

Yes, the GTX 1070 seems like a good deal, I would go ahead with that!
Fabien says
Thanks!
I gave it a go as I could return it if needed.
The gap between 750ti and 1070 is so huge…
Let’s have fun now

quest says
I am about to buy three Gpus for deep learning and sometimes entertainment.
These are
one Gigabyte Aorus 1080 Ti card and two EVGA 1080Ti. All three are having the
same chip i.e 1080 Ti and only difference is their cooling solution, my questions
are:-
a) Will i be able to use all three of them for parallel deep learning ?
b) Will i be able to SLi the Aorus and EVGA cards?
c) Is there any other trouble e.g. related to bios etc by mixing the same chip cards
but from multiple vendors?
Thank you very much
Tim Dettmers says

a) Yes that will work without any problem

b) I had a SLI of an EVGA and an ASUS GTX Titan for a while and I assume this
still works, so yes!
c) There should not be any big troubles Both for games and for deep learning
c) There should not be any big troubles. Both for games and for deep learning
the BIOS does not matter that much, mixing should cause no issues.

I think on the 1080 Ti you can only get 2-way SLI. A standard SLI bridge will do
ne for up to 4K resolution, a high-bandwidth bridge is needed for 5K and 8K
resolution.
You can use all three of them for parallel deep learning, or any other CUDA-
based and OpenCL-based application.
Moondra says
Thank you.
I was looking to do some Kaggle competitions as well as video editing.
I guess I will get the GTX 1060 6GB.. off to Slick deals!
Tim Dettmers says

Sounds like a solid choice! I am glad the blog post was helpful!
Roberto says
I’m looking for an used card. What’s better between 960 4GB ddr5 and 1050Ti
4GB ?
I’m asking because the 960 has more cuda cores. THX!
Tim Dettmers says

The cards are very similar. The GTX 1050 Ti will be maybe 0-5% faster. I would
go for the cheaper one.
Ahmed Adly says

Hello Tim,
You are providing great information in this blog with signi cant value to people into
deep learning.
i’m having iMac 5k with core i5 and 32 GB RAM, and thinking to add one NVIDIA
Titan Xp to it via eGPU as rst step into deep learning, do you think this is a good
choice or i should sell it and go directly into custom GPU rig ?
Also is there any readymade options ?
Ahmed Adly says

Also can you comment on this setup?
https://pcpartpicker.com/list/yCzT9W (https://pcpartpicker.com/list/yCzT9W)
Tim Dettmers says

Looks like a pretty high-end setup. Such a setup is in general suitable for
data science work, but for deep learning work I would get a cheaper setup
and just focus on the GPUs — but I think this is kind of personal taste. If
you use your computer heavily, this setup will work well. If you want to
upgrade to more GPUs later though, you might want to buy a big bigger
PSU. I think around 1000-1200 watts will keep you future proof on
upgrading to 4 GPUs; 4 GPUs on 850 watts can be problematic.
Ahmed Adly says

Thanks alot Tim, I have changed the setup by adding 1200 W PSU, and
4 way SLI motherboard with i7 7700 Kapy lake processor.
The remaining question is that this motherboard supports only 64 GB
The remaining question is that this motherboard supports only 64 GB

RAM, will this make future problems ? do i need 128 GB if i will work
on 250 GB+ data sets ?
updated list: https://pcpartpicker.com/list/gHTLBP

(https://pcpartpicker.com/list/gHTLBP)
Best,
Ahmed
Mario says
2017-08-21 at 16:02
Hi Ahmed
Thank you for sharing your setup. I was wondering, did you
shortlist any alternative motherboard for the 128gb limtiation?
I have come to the same conclusion, I don’t like the 64gb limit.
Tim Dettmers says
Tim Dettmers says

There are some readymade options, but I would not recommend them as they
are too pricey for its performance. I think building your own rig is a good option
if you really want to get into deep learning. If you want to do deep learning on
the side, an extension via eGPU might be the best option for you.
Semanticbeeng says
Hi Tim – many thanks for all the knowledge and time!
Application: multi-GPU setup for deep learning requiring parallelization across

GPUs and at least 32GB of GPU RAM.
Say I choose to use 4 GTX 1080 Ti and am concerned with the loss due to inter GPU
communication but also with the heat/cooling and noise.
Based on all your teaching above am thinking it would be better to use two smaller
computer cases, with two GPUs each and connect them with an In niband FDR
card than try to cram all 4 GPUs in a single box.
Also, having 2 smaller boxes gives :
1. more resiliency in terms of point of failure

2. dynamic scalability in terms of bring up all 4s or just use 2
3. exibility if I want to replace 2 with more powerful math GPUs for some
applications needing higher precision.
Is this on the right path?

Do you have any data on how much memory bandwidth loss there would be in this
setup as opposed to putting all 4s in the same box?
Saw NA255A-XGPU https://www.youtube.com/watch?v=r5y2VbMaDWA
(https://www.youtube.com/watch?v=r5y2VbMaDWA) but it is very expensive.
Please provide a critical review and advice.
Many thanks
Nick
Tim Dettmers says

I would use as many GPUs per node as you can if you want to network them
with In niband. Parallelism will be faster the more GPUs you run per box, and if
you only want to parallelize 4 GPUs then 4 GPUs in a single box will be much
faster and cheaper to 2 nodes networked with In niband Also programming
faster and cheaper to 2 nodes networked with In niband. Also programming

those 4 GPUs will be much, much easier if you use just one computer. However,
if you want to scale out to more nodes it might differ. This also depends on your
cooling solution. If you have less than 32 GPUs I would recommend 4 GPUs per
node + In niband cards. For usual over-the-counter motherboards you can run
3 GPUs + In niband card. The details of an ideal cost effective solution for
small clusters depends on the size (it can differ dramatically between 8, 16, 32,
64, 96, 128 GPUs) and the cooling solution, but in general trying to crank as
many GPUs into a node as possible is the most cost effective way to go and also
in terms of performance ideal. In terms of maintenance costs this would not be
ideal for larger clusters with +32 GPUs, but it will be okay for smaller clusters.
Semanticbeeng says
many thanks!
“depends on your cooling solution” – please suggest a solution that would
not kick me out of the house from heat, noise and electromagnetic
radiation.
” If you have less than 32 GPUs I would recommend 4 GPUs per node +
In niband cards. ” – wdym? I thought In niband is for in between nodes.
you meant “32 GB” instead of “32 GPUs”?
many thanks in advance, Tim!

“programming those 4 GPUs will be much, much easier if you use just one
computer” – kindly provide some hints about where this would be visible in
code/libraries we’d use
code/libraries we d use.
Eric Perbos says

You’d also want at least 128gb Ram with such a setup so make sure the Mobo
can (though any > 4 PCI-E most likely does anyway).
For example, in Lesson 2 of Fast.ai MOOC, they use some code to concatenate
batches into an array that takes 55gb of Ram.
No problem on AWS original course setup as basic EC2 can scale to 2tb.
But for newcomers with personal PC with GTX 1080Ti and 32gb RAM, it
generates Memory Error: requires investigating RAM issues, rewriting code
etc., instead of focusing on Deep Learning.
Also consider the “Aero” cooling (like Founder Edition 1080 Ti) as it expels heat
via the GFX backpanel outside the case, instead of 4 monsters gladly blowing at
each other inside 4 walls
Semanticbeeng says
“Founder Edition 1080 Ti” with “aero cooling” – check!
“Fast.ai MOOC” – nice reference, check!
“at least 128gb Ram with such a setup so make sure the Mobo can” –
So I need 128 GB RAM on the motherboard to handle the 4 GPUs?

Or you mean in total with the GPU RAM – and GPUs being in the same
card forms a continous address space with the CPU/mother board RAM.
If you meant second then I need 128 – 4 x 11 GB GPU RAM = 90 GB on

the CPU.
Kindly clarify.
Mario says
Semanticbeeng
Another aspect to consider is that parallelising on multiple machines is

not as easy as parallelising on the same machine.
I don’t think you want to write the code that does the distribution
I don t think you want to write the code that does the distribution
yourself – you want it to be transparently handled by the library you
are using (Tensor ow/Torch etc).
Now, if we are talking about say, hyperparameter tuning, this is rather

easy to distribute: each execution is independent (unless perhaps if
you’re running something like that needs to adjust the params
dynamically), and you can ship it out to a separate machine easily.
But withing the same model, things are not that easy anymore. Some
lend themselves to being distributed.
I believe that current libraries are much better at distributing across

multiple GPUs on the same machine, relatively effort-free (con g
change), as opposed to across a network cluster.
I share your concern on the single point of failure (machine catching

re). So perhaps the solution is to have 5 X (4 gpu machine) ? Just
partially joking. Good luck with the heating.
Nacho says
Hi!
I am currently starting my thesis on Adversarial Learning. The department in which I
will be working provides me (limited) remote access to a workstation in order to
train large volumes. However, I was thinking on getting some cheap machine (my
budget is very limited at the moment) in order to try small simulations since I cannot
run anything on my laptop. Furthermore, I am not sure how much I will use my
computer in the future (it depends if I go for PhD or not after all), so I just need a
basic machine that allows can be upgraded in case is needed.
Here a con guration I found in an online shop in Germany :
Fractal Design Core 1100

Prozessor: Intel Core i3-7100, 2x 3.90GHz
Kühler: Intel Standard Cooler
Mainboard: Gigabyte H110M-DS2, Sockel 1151
Gra kkarte: MSI GeForce GTX 1050Ti Gaming 4G
Speicher: 8GB Crucial DDR4-2133
Festplatte: 1TB Western Digital WD Blue SATA III
Laufwerk: DVD+/-Brenner, DoubleLayer ändern
Netzteil: 300W be quiet! System Power 8 80+
Soundkarte: HD-Audio Onboard
Price: 635€
Do you think it’s a big deal? Would be enough given the needs I described before?
Would you change anything?
Tim Dettmers says

The setup will be okay for your thesis. Maybe a GTX 1060 would be more
suitable due to the larger memory, but you should be okay to work around that
4GB GPU memory and still write an interesting thesis.
If you like to pursue a PhD and need more GPUs the board will limit you here.
The whole setup will be more than enough for any new GPU which will be out
in the next few years, but your motherboard just holds one GPU. If that is okay
for you (you can always access cloud based GPUs or the one’s provided for
extra computational power) then this should be a good, suitable choice for you.
Nacho says
Thank you so much for you comment. I found a similar machine a bit
cheaper:
https://www.amazon.de/DEViLO-1243-DDR4-2133-24xDVD-RW-
Gigabit-LAN/dp/B003KO3HQM/ref=sr_1_10?
s=computers&ie=UTF8&qid=1493910853&sr=1-
10&keywords=DEViLO&th=1 (https://www.amazon.de/DEViLO-1243-
DDR4-2133-24xDVD-RW-Gigabit-
LAN/dp/B003KO3HQM/ref=sr_1_10?
s=computers&ie=UTF8&qid=1493910853&sr=1
10&keywords=DEViLO&th=1)
Plus it has windows 10 already installed (some time to save). I compare

them many times and I cannot see any major difference.
Thank you once more.
Cheers,
Nacho
Nacho says
Hi Tim,
One last question. Would a CPU AMD FX-6300, 6x 3.50GHz do a similar

job than using a intel i3-7100? I have seen that the price difference is
pretty big and it would maybe allow me to get a better gpu (1060 6gb). Is it
worth it?
Thank you once more
Nacho
Nacho says
Actually I saw this model:

https://www.amazon.de/8-Kern-DirectX-Gaming-PC-Computer-8×4-
30/dp/B01IPDIF4Q/ref=sr_1_1?
1&keywords=gtx%2B1060%2B6gb&th=1
(https://www.amazon.de/8-Kern-DirectX-Gaming-PC-Computer-8x4-
30/dp/B01IPDIF4Q/ref=sr_1_1?
1&keywords=gtx%2B1060%2B6gb&th=1)
Which includes a 1060 6gb, 16 ram for 699 euros. The only thing is
that it has a AMD Octa-Core FX 8370E CPU instead of an Intel.
What do you think?
Semanticbeeng says
Alternatives for AWS servers

1. https://www.hetzner.de/sk/hosting/produkte_rootserver/ex51ssd-gpu
(https://www.hetzner.de/sk/hosting/produkte_rootserver/ex51ssd-gpu)
dedicated server (not VM)
1 GeForce® GTX 1080 (only)
64 GB DDR4 RAM
€ 120 / month
2. https://www.ovh.ie/dedicated_servers/ (https://www.ovh.ie/dedicated_servers/)
4 x NVIDIA Geforce GTX 1070
64GB DDR4 RAM
€ 659.99 / month
Tim Dettmers says

I have heard good things about the hetzner one; it can be a good option. Thanks
for the info.
Kelvin says
Hi Tim – many thanks for all the knowledge and time!
I’d like to buy a deeplearning server, would you please give a performence
comparison between telsa P100 and titan Xp?
Tim Dettmers says

The Tesla P100 is faster, but cost far too much for its performance. I would not
recommend the P100.
Ahm says

Hello,
Hello,
Thanks for the great post. I’m very new to this topic, and wanted to set up a desktop
computer for deep learning project (for fun and kaggle). I had 2 questions, and really
appreciate it if you could help me with them:
1) I just ordered a GTX 1080 Ti from NVidia website. Then I noticed that there are
other versions like EVGA, or MSI… I’m kind of confused what they’re are, or if I
should’ve got those.
2) For desktop, I just got a Lenovo ThinkCentre M900, with i7 6700 processor, 32
Gig Ram. Then I doubted if I really can put my GTX1080 inside that.. any thougth?
thanks a lot
Tim Dettmers says

The case seems to be able to hold low-pro le x16 GPU cards. It seems that the
system would support a GTX 1080, but the case does not. It will probably be
too big for the case. I would contact your vendor for more precise information.
Fabien says
Hi, I would add that the PSU (website says max power 400W as an option)
Hi, I would add that the PSU (website says max power 400W as an option)
seems under-scaled for the 1080ti.
Also, I experienced an issue with a Dell system where the PSU could not be
changed for a more powerful one because of a proprietary motherboard
connector. I would not be surprised that you found the same on this kind of
system.
As Tim suggested, contact your vendor for compatibility info.
Ahm says
Thanks a lot. I canceled it, and found pcpartpicker website which checks
the compatibility between the parts.
thanks again
Kenneth says
Looking forward to when you update this with the new V100!
Particularly curious if the increased bandwidth of nvlink v2 changes your opinion of
multi gpu setups.
Vineeth (http://deepview.ai) says

Hi Tim
Great blog and very informative for the deep learning enthusiasts. I am building a
rig myself for deep learning. Here are the components I am planning to
get.https://pcpartpicker.com/list/WzGZd6 (https://pcpartpicker.com/list/WzGZd6)
One question I have is regarding the processor. Seems like Xeon 1620 V4 is pretty
good, in terms of 40 lanes. It is outperformed by 1650V4 but also twice the price (
$600). To add total 4 GPUs, I’ll also need a mobo with 4 PCI lanes at least, so
something like a MSI X99A gaming pro looks reasonable, although not sure they’ll
physically t in. I might just do 3 gpus then. Any comments.
Tim Dettmers says

Looks reasonable, the motherboard supports 4 GPUs, but it seems only 3

GPUs t in there. So if you want to go with 4 GPUs you should get a different
motherboard.
Sourish says
Hi. Lovely explanation! I am a newbie and I have a GPU having 4GB dedicated video
memory and 4GB shared memory and around 112GB/s bandwidth (GeForce GTX
960). I need to know the number of convolution layers that I can implement using
such a hardware. Besides, what would be the maximum size of the input image that I
can feed to the network?
Tim Dettmers says

This depends on input size (how large the image is) and can be dramatically
altered by pooling, strides, kernel size, dilation. In short: There is no concrete
answer to your question, and even if you specify all the parameters, the only
way to check this would be to implement the same model and test it directly So
way to check this would be to implement the same model and test it directly. So
I would suggest you just try it and see to how many layers you can get.
Pedro Porto Buarque de Gusmão says

Hi Tim,
any thoughts on the new Radeon Vega Frontier Edition, from a hardware point of
view and hoping that DL libraries will come soon?
Tim Dettmers says

The card is impressive from its raw hardware specs. However, I am unaware
that any major changes in deep learning hardware will come along with the
card. As such, AMD cards are still not viable at this time. It might chance in a
few months down the road but currently there is not enough evidence to make
few months down the road, but currently there is not enough evidence to make
a bet on a AMD card now. We just do not know if it could be useful later.
Howard Park says

Hi Tim! Thank you for great post !

do you think we will use FP64 for deep learning any soon ?
and could you give me some examples of using FP64 for deep learning ?
Thank you.
Tim Dettmers says

FP64 is usually not needed in deep learning since the activations and gradients
are already precise enough for training. Note that the gradient updates are
never correct with respect to the full dataset if you use stochastic gradient
descent and thus more precision just does not really help You can do well with
descent and thus more precision just does not really help. You can do well with
32FP or even 16FP and the trend is further downwards rather than upwards. I
do not think we will see FP64 in deep learning anytime soon, because frankly
there is no reason to use it.
Rajiv (https://scribie.com) says

AWS documentation mentions that the P2 instances can be optimized further,

speci cally persistence mode, disabling autoboost and setting the clocks to max
frequency. Are any such optimizations that can be done for GTX 1080Ti cards?
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-
instances.html#optimize_gpu
(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-
instances.html#optimize_gpu)
Tim Dettmers says

The performance difference of doing that compared to autoboost is usually not

that relevant (1-5%) only if you really have large-scale, persistent operations
(1-5% saving in money can be huge for big companies). If you want to improve
performance focus on cooling, that is there the real bene ts lie. I do not know if
P2 instances have good cooling though.
Ashkan says
when I think of several parameters associated with computational costs I get

confused and I wonder if we can introduce comprehensive and illustrating metrics
on which could be relied specially when selecting a GPU is in uential for a speci c
task e.g., Deep Learning. with all due respect I believe your post is some how leading
and somehow confusing! for instance, one might choose 1060 6GB over 980 8GB
due to the higher memory bandwidth. is that really the correct decision? and for
other close-performance cards! I mean you might overlook some parameters
regarding to how the program performs on GPU and how the GPU implements the
codes. I mean mostly the Shading Units and Ram size at least in your comparisons.
for instance, how could we compare a GC with memory bandwidth of 160Gb/s,
1500 shading units and 8GB memory with the one of 200Gb/s, 1250 units and 6GB
ram? Although I have searched everywhere in the net and read several papers, I
cannot answer it scienti cally or at least I cannot prove it. but one think I claim is
every percent increased in the number of shading units is more effective in reducing
every percent increased in the number of shading units is more effective in reducing
computational time than that of the same percent increase in the memory
bandwidth. for example, 10 percent higher bandwidth against 10 percent larger
Shadings. I think in some cases, If we only think of parallelism, then, if we have
80GB/s and 1000units equals with the case with 160GB and 500! I am not sure
about that. I would be glad to hear your opinions.
Tim Dettmers says

Shading units are usually not used in CUDA computations. Memory size also
does not affect performance (at least in most cases). It is dif cult to give
recommendations for RAM size since every every eld and direction of work
has very different memory requirements. For example Computer vision
research on ImageNet, on GANs, and computer vision in industry all have
different requirements for memory size. All I can do is to give a reasonable
direction and have people make their own choice. In terms of bandwidth you
can compare cards within teach chip class (Tesla, Volta etc), but across it is very
dif cult to give an estimate of 160GB/s vs 125GB/s etc.
In short, my goal is not to say: “Buy this GPU if you have these and these and
these requirements.”, but more like “Look, these things are important
considerations if you want to buy a GPU and this makes a fast GPU. Make your
choice.”
Bro perfect says

Is it possible to combine the computational power of 6 machines with 8 GPUs each?

Is it possible for an algorithm to be able to use all 48 GPUs?
Thanks
Tim Dettmers says

Yes. It is a dif cult problem however you can tackle it with the right hardware
(network with 40 – 100 Gbit/s In niBand interfaces) and right algorithms.
I recommend block momentum (https://www.microsoft.com/en-

us/research/publication/scalable-training-deep-learning-machines-
incremental-block-training-intra-block-parallel-optimization-blockwise-model-
update- ltering/) or 1-bit stochastic gradient descent

(https://www.microsoft.com/en-us/research/publication/1-bit-stochastic-
gradient-descent-and-application-to-data-parallel-distributed-training-of-
speech dnns/) algorithms Do not try asynchronous gradient methods they
speech-dnns/) algorithms. Do not try asynchronous gradient methods — they

are really bad!
Aimee Vanallen says

This website is known as a stroll-by for all the information you wished about this
and didn’t know who to ask. Glimpse right here, and you’ll de nitely uncover it.
weisheng says
Hi Tim,
Thanks for your sharing and I enjoy reading your posts.
Would you please comment my 4 1080ti build

https://pcpartpicker.com/list/nWs7Fd (https://pcpartpicker.com/list/nWs7Fd)
Note that partpicker not allowing me to add 4 1080ti so I put 4 TitanX as dummy
Note that partpicker not allowing me to add 4 1080ti so I put 4 TitanX as dummy
GPUs.
My spec
(“ref” means time of related Tim’s post,
“#” means my questions)
*1: main use: kaggle competition trained in Pytorch
*2: GPU x4 1080ti

#can this build support 4 Volta GPU in future?
*3:Motherboard: Asus x99-WS-E

<- con rmed Quad PCI 3.0 x16/x16/x16/x16, DDR4 ~128GB
*3: CPU Xeon 2609v4 40lane PCI 3.0

<-compare i7-5930k, 40% cheaper and 2x DDR4 size
2GHz, why?
#shall choose 3.7GHz i7-5930k for 2x clock speed?
@ref_cpu_clk shows underclocking i7-3820 to 1/3 causes performance drop of ~8%
for MNIST and 4% for Imagenet)
<- CPU max DDR4 speed is 1866 MHz (RAM speed no difference @ref2 2017-03-
23 at 18:35)
*4: RAM: 64Mbyte now, maybe 128MB in future
ref1: http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/
(http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/)
ref2: http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/
(http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/)
ref_cpu_clk: http://i0.wp.com/timdettmers.com/wp-
content/uploads/2015/03/cpu_underclocking2.png?
zoom=1 5&resize=500%2C337 (http://i0 wp com/timdettmers com/wp
zoom=1.5&resize=500%2C337 (http://i0.wp.com/timdettmers.com/wp-
content/uploads/2015/03/cpu_underclocking2.png?
zoom=1.5&resize=500%2C337)
Tim Dettmers says

If you buy 4x GTX 1080 Ti and want to work on Kaggle competitions, I would
not skimp on the CPU and RAM. You will do a lot of other things besides deep
learning if you do Kaggle competitions.
Otherwise it is looking good. I think 4x GTX 1080 Ti is a bit overkill for the
Kaggle. You could also start with 2 GPUs, see how it goes, and add more later if
you want.
Weisheng says
Hi Tim, thanks for your comment. I choose 128G RAM and i7 3.xGHz CPU
(5930k or 6850k) based on your [CPU conclusion]
(http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/
(http://timdettmers com/2015/03/09/deep learning hardware guide/))
(http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/))
Two threads per GPU; > 2GHz;
FYI, the 4 GPUs are for 2 Kaggle participants.
vir das says

Hey. I am creative coder and a designer. I am looking forward to experiment with AI,
interaction design and images.
Projects similar to https://aiexperiments.withgoogle.com/autodraw
(https://aiexperiments.withgoogle.com/autodraw)
1) I am a beginner and looking a laptop as it’s handy. Little low on budget so i need to
decide.
Feasible option is
960 4gb http://amzn.to/2r5xujB (http://amzn.to/2r5xujB)
But if it doesn’t work at all i might consider the following 2 options:
Gtx 1060 3gb — http://amzn.to/2qXCzwc (http://amzn.to/2qXCzwc)

Gtx 1060-6gb — http://amzn.to/2rZlpMO (http://amzn.to/2rZlpMO)
Not able to decide which be suf cient.
2) Do i need to learn machine learning and train neural networks by myself or can i
just apply things from the open sources already available? Do you know good
resources for the same?
Tim Dettmers says

You can use pretrained models which will limit you to certain themes, but you
do not need learn to train these models yourself. A GTX 960 is also suf cient
for this (even a CPU would be okay). This might be enough for some interaction
design experimentation. If you want to go beyond that I would recommend a
GTX 1060 6GB. You will need to learn how to train these models (which might
not be that dif cult; it is basically changing parameters in open source code).
Eric Perbos-Brinck says

Regarding your question #2, I highly recommend “Practical Deep Learning for
Coders” by Jeremy Howard (ex Kaggle Chief Scientist) and Rachel Thomas, in
partnership with the Data Institute of the University of San Francisco.
https://www.usfca.edu/data-institute/certi cates/deep-learning-part-one
(https://www.usfca.edu/data-institute/certi cates/deep-learning-part-one)
It’s a free MOOC with superb ressources (Videos, Class Notes, Papers,
Assignments, Forums).
Spicer says
Hi Tim,
This is a great article and comments!
So…If you have only $1,500.00 today for your budget,

which components would you pick for a complete system
and why?
Thanks in advance.
Tim Dettmers says

I would probably pick a cheap motherboard that supports 2 GPUs; DDR3

memory, cheap CPU, 16GB of RAM, a SSD, a 3TB hard drive and then probably
two GTX 1070 or better if budget allows.

@Spicer
Here’s a paper published today that may help you.
https://blog.slavv.com/the-1700-great-deep-learning-box-assembly-setup-
and-benchmarks-148c5ebe6415 (https://blog.slavv.com/the-1700-great-
deep-learning-box-assembly-setup-and-benchmarks-148c5ebe6415)
Eric
Bill Simmons (http://www.spoatech.com) says

Hi Tim,
Well, I am going all out to build a Deep Learning NN training platform. I plan to
spend the money for the following:
2 TitanP cards w two remaining Gen3 PCIe slots available for expansion.
ASUS ROG ROG RAMPAGE V EDITION 10 w 64MB DDR4 3200 RAM, M2 SSD, i7
Processor, and with two ethernet ports – connecting 1 Ethernet Port to my internal
network for access and the 2nd ethernet port connected directly to a data le
server for access to training data.
Can you tell me the weaknesses with such a rig and where I might be spending
money in the wrong places? Where could money be spent instead, to get an even a
bigger bang for the buck?
Cheers,
Bill
Tim Dettmers says

Sounds like a reasonable setup if you want to use a data le server. If you want
to get more bank for the buck you can of course use DDR3 RAM and a slow
processor but with this setup you will be quite future proof for a while, so that
you can upgrade GPUs as they are released Also you are more exible if you
you can upgrade GPUs as they are released. Also you are more exible if you
want to do other work that needs a fast CPU. So everything seems ne.
brabo says
Which mini card you will recommend for a itx platform?

Super Flower Golden Green HX 80 Plus Gold Netzteil – 350 Watt SF-350P14XE
(HX)
G776 Cooltek Coolcube Aluminium Silver Mini-ITX Case
Thank you very much!:D
Tim Dettmers says

I am only aware of 3 current cards which are suitable for such systems: GTX
1050, GTX 1050 Ti, and GTX 1060 mini. These should t into most mini-ITX
cases without any problem, but second check that the dimensions are right.
While these cards are made to t ITX boards they may not t every mini ITX
While these cards are made to t ITX boards, they may not t every mini-ITX
case.
sami says
Hi. Thank you Tim for such a wonderful blog. I have questions regarding Volta gpu.
It is expected that Geforce version of Volta will be launched anytime towards the
end of 2017 or early 2018. Do you think there would be so much difference in
performance between 1080Ti based rig( which i am thinking of getting for DL) and
the volta that i shud wait for volta?
My second question is when the volta will come, will it need newer motherboards or
can it be used with its full strength/might on the currently available motherboards
too , e.g i am thinking of getting EVGA X99 classi ed mbo, so if i wish to get voltas in
future then could it be installed on this mbo or the voltas will need newer series of
boards ?? Thank you for your help.
Tim Dettmers says

Consumer Volta cards is what you want. These Volta cards will probably be
released a bit later, maybe Q1/Q2 2018 and will t into any consumer-grade
motherboard — so a EVGA X99 will work just ne. They will be a good step-up
from Pascal with a similar jump in performance that you see between Maxwell
and Pascal.
samihaq says
But won’t Volta consumer cards use NVLink interface and the current
motherboards like EVGA X99 don’t support NVLink??
In case if the current motherboards will support Volta without making it
work in backward compatibility mode, i.e, without using its full potential,
then probably it would be better for me to get a EVGA 1080Ti for now and
get two Volta consumer cards when they will be introduced. Any
suggestions, please. Thank you very much.
Nikolaos Tsarmpopoulos says
p p y
Hi,
NVLink has just been introduced for the rst time on the Quadro GV100 for the
desktop. The link adapter alone costs £1000 or so. It is intended for HPC and deep
learning applications but it won’t be necessary for the consumer cards, since x8
PCIE lanes are still considered suf cient and the cards can still use x16 lanes in SLI,
with a 40 lanes CPU. Hence, it’s very unlikely NVLink will be included in consumer
cards anytime soon.

Hi,
NVLink has just been introduced for a professional grade desktop card, Quadro
GP100 and the link adapter alone costs £1000 or so. It is intended for HPC and
deep learning applications but it won’t be necessary for the consumer cards, since
x8 PCIE lanes are still considered suf cient and the cards can still use x16 lanes,
even in a 2-way SLI con guration, with a 40 lanes CPU. Hence, I think it’s very
unlikely NVLink will be included in consumer cards anytime soon.
David says
I’m currently using my cpu (Xeon E5-2620 with 64GB of memory) to train large
convolution networks for 3D medical images, and it is rather slow. The training
(using keras/tensor ow) takes up 30-60 GB of memory, so I don’t think the network
could train on a single graphics card. Would buying/adding multiple graphics cards
net my system enough GPU memory to train a 3D CNN?
Tim Dettmers says

I would try to adjust your network rather than your hardware. Medical images
are usually large so make sure you slice them up in smaller images (and pool the
results for each image to get a classi cation). You can also utilize 1×1
convolutions and pooling to reduce your memory footprint further. I would not
try to run on such data with CPUs as it will take too long too train. Multiple
GPUs used in model parallelism for convolution is not too dif cult to implement
and would probably work if you use ef cient parallelization frameworks such as
CNTK or PyTorch but it quite http://timdettmers com/wp admin/edit
CNTK or PyTorch, but it quite http://timdettmers.com/wp-admin/edit-

comments.php#comments-formcumbersome (http://timdettmers.com/wp-
admin/edit-comments.php#comments-formcumbersome) still.
Chanhyuk Jung says

I am trying to make a autonomous drone navigation system for a project. I want to

have two cameras with two motors each(so they could turn like the eye), two
accelerometers(because humans have two) and two gyroscopes(same reason as
before) as inputs for the neural net and out put the four motors of the drone. I’m
trying to apply deep learning to make the drone autonomous. But I’ve only worked
on gpus with at least 100GB/s bandwidth. Since the computer needs gpio to control
the motors and recieve input in real time, I went for a single board computer. I
couldn’t nd boards with great gpu performance except for the jetson modules.
What single board computer or SoC would you recommend?
Tim Dettmers says

I think Jetson is almost the only way to go here. Are modules are to big to work
on drones. However, another option would be to send images to an external
computer, process them there and send the results back. This is quite dif cult
and will probably have a latency of at least 300ms and probably closer to
750ms. So I would go with a Jetson module (or the new isolated GPUs which is
basically a Jetson without any other parts). You can also interface a Jetson with
an Arduino and as such you should all that you need for motor control.
Tim says
Movidius has some chips that might be useful for you, although they are quite
specialized for visual deep learning.
hamza says
Hi,
Thank you Tim for such helpful suggestions. I am interested in 4 GPU(EVGA with
custom cooling technology ICX) setup for DL, does any one based on their
experience can recommend/suggest any reliable motherboard for holding 4 GPUs
comfortably with ventilation space. It is very confusing as there many many board
available but some have little/no space for 04 gpus and others have reliability issues
as posted on newegg. Secondly my question is will 1300W PSU will be enough to
support 04 GPUs?? Thank you very much.
Tim Dettmers says

Yes, the motherboard question is tricky. The cooling does not make a huge
difference though, it is mostly about cooling on the GPUs not around them. The
environment where the GPU is standing and the speed of the fans are much
more important than the case for cooling. So I would opt for a reliable
motherboard even if there is not too much ventilation space.
Tim Dettmers says

Oh and I forgot, 1300W PSU should be suf cient for 4 GPUs. If your CPU
draws a lot (> 250W) you want to up the wattage a bit.
hamza says
Hi Tim. Thank you for your suggestion. Can you please look at these two
newegg links and based on your extensive experience, make a suggestion
for mbo. My criteria is durability and to be future proof. I want to install 02
1080 Ti for now and 02 Volta consumer gpu when they will come next
year, (each card will take 2 slots).
First is Asus E-WS mbo having 07 Pcie sockets,price 514$,
https://www.newegg.com/Product/Product.aspx?
Item=N82E16813182968&ignorebbr=1&cm_re=asus_e-
ws_x99_motherboard-_-13-182-968-_-Product
(https://www.newegg.com/Product/Product.aspx?
Item=N82E16813182968&ignorebbr=1&cm_re=asus_e-
ws_x99_motherboard-_-13-182-968-_-Product) and second is evga having
05 Pcie sockets, price 300 $,
https://www.newegg.com/Product/Product.aspx?
Item=N82E16813188163&nm_mc=AFC-C8Junction&cm_mmc=AFC-
C8Junction-PCPartPicker,%20LLC-_-na-_-na-_-
na&cm_sp=&AID=10446076&PID=3938566&SID=
(https://www.newegg.com/Product/Product.aspx?
Item=N82E16813188163&nm mc=AFC C8Junction&cm mmc=AFC
Item=N82E16813188163&nm_mc=AFC-C8Junction&cm_mmc=AFC-
C8Junction-PCPartPicker,%20LLC-_-na-_-na-_-
na&cm_sp=&AID=10446076&PID=3938566&SID=) . Thank you very
much.
Tim Dettmers says

I have no experience with these motherboards. The Asus one seems

better as it seems more reliable from the reviews. However, if you
factor in the price I would go with the EVGA one. It will probably work
out just ne and is much better from a cost ef ciency perspective.
Matthew says

Hi,
Hi,
I’m hoping someone could help recommend which GPU I should buy. My budget is
limited at the moment, and I’m just starting out with deep learning. So I have
narrowed down my options to either the GTX 1050 ti or the 3GB version of the
1060. I’m mainly interested, at least to start with, in things like Pix2pix and
CycleGan. So I’m unsure if the extra 1GB of memory on the 1050ti would be better,
or the extra compute power of the 1060. The 6GB version of the 1060 is a bit
beyond my budget at the moment.
Thanks.
Tim Dettmers says

Tough question. For pix2pix and GANs in general more memory is often better.
However, this is also dependent on your data. I think you could do some
interesting and fun stuff with 3GB already, and if networks do not t into
memory you can always use some memory tricks. If you use 16-bit oating
point models then you should be quite ne with 3GB, so I would opt for the
GTX 1060.
wang says
Hey Tim, perfect article! Thank you very much！

I have a question， how about tesla M2090？ compared to 1060.
Tim Dettmers says

A Tesla M2090 would be much slower than a GTX 1060 and has no real
advantage. It costs more. So de nitely choose a GTX 1060 over a Tesla M2090.
M fazaeli says

Hi,
Hi,
after we chose a GPU like 1080 ti, how to assemble a good box for it. There is bunch
of MB that are gaming speci c and they are not designed for days of computing.
choosing best model is also not a good option cause they cost much more than the
GPU it self. having a 1000watt+ PSU and a open loop cooler is cost u can upgrade
form 1070 to 1080ti.
does these card need xmp rams on 3000MH or a 2133 is enough?
I think the story continuous more complex after choosing the card. and I ask u light
up the way to select system part not for overclocking GPU but just turn it on. This
guide will change the budget to afford for GPU.
Tim Dettmers says

Gaming motherboards are usually more than enough. I do not think compute
speci c motherboards have a real advantage. I would choose a gaming
motherboard which has good reviews (which often means it is reliable). I would
try to go with less fancy options to keep the price down. You do not really need
DDR4, and a high clock for RAM is also not needed for deep learning. You
might want those features if you work on more data science related stuff, but
not for deep learning.
Alper (http://alpslabel.wordpress.com) says

Hi Tim, thanks for such a through exploration. But I have some others questions in
mind, related to CPU side. Normally, with a single card, (TitanX maxwell, as well as
1080Ti), one CPU core stays on top, other three uctuate between 10-50 range
(p8z77, i3570K, 32GB, under DIGITS). Now I am planning to make a change, with
an X99-E WS board. The only thing I could not decide is, if DL apps can use single
core only, should we look fastest single core CPUs? At the time of writing, the
fastest single core is on the i7700K 4.2 CPU. I am planning to buy a Xeon 2683 v3
processor which appears faster on CPU benchmarks, but slower when it comes to
single-core perf.
Due to this fast-single-core subject, I cannot decide. Since my aim is to go for 4
GPUs, should I go for 4-cores one, or 14-cores one? I have used dual TitanX setup
for a while and saw CPU percentage raised to 280%s, compared to 170-180%s
with a single GPU. From this observation, CPU perf. appears important to a degree,
to my eye. Any opinions?
Tim Dettmers says

The CPU speed will make litte difference, see my analysis in my other blog post
The CPU speed will make litte difference, see my analysis in my other blog post
about hardware. (http://timdettmers.com/2015/03/09/deep-learning-
hardware-guide/) Often a high percentage of CPU means actively waiting for
the GPU to do its job. Running 4 or more nets on 2 GPUs on a 4 core CPU
never caused any problems for me; the CPU never saturated (and this is with
background loader threads). So no worries on the CPU size, almost any CPU
will be suf cient.
Jay says
where do you stand on nvidia founder’s edition vs custom gpus; is there a

preference for one over the other in a DL setting?
Tim Dettmers says

The run the same chip, so they are essentially the same. It is a bit like
overclocked GPUs vs normal ones; for deep learning it makes almost no
difference.
Mario says
Hi Tim
First off, kudos , like everybody else, for making our life immensely easier. I
apologise in advance for an inaccuracies, this is all rather new to me.
I am currently considering getting my rst rig, I have settled on a 1080Ti as a good
compromise of ram, budget, performance and future-proofness.
For what I sense, the next logical upgrade would be to add another 1080Ti
whenever I max it out, and continue using 16 lanes per card.
Having 3 cards on the rig would offer less of an improvement for training times,
because of a limit to 8 bit lanes, increased overhead in coordinating parallelisation,
increased cooling requirements, larger PSU.
Hence my envisaged upgrade route would be, if and when required:
1) Add a second 1080Ti.
2) Add a second rig (this will help with HyperParam tuning, I suspect it won’t help
much with parallel training on the SAME model)
Am I wrong in envisaging it this way? I.e. would it still be more cost ef cient to try
and push a third and fourth card on the same rig? In which case I have to pay more
attention to the motherboard selection. I want to have 64GB of RAM on the
motherboard itself so this is already pushing my into the “right” territory with
motherboard itself, so this is already pushing my into the right territory with
regard to choosing a motheboard that supports 3 or 4 GPUs.
Inputs welcome. – and thanks again.

MR
Tim Dettmers says

Do not worry too much about the problems with lanes, cooling etc if you add a
third GPU. The slow-down will be noticeable (10-20%) but you still get the
most bang for the buck. You are right that adding another server does not help
with training a single model (it gets very complicated and inef cient), and it
would be kind of wasted as a parameter machine. I would go and stuff one
machine as much as possible and if you need more computing power after that,
then buy a new machine.
smb says
Hey,
My question is (again ) about PCIe 2 performance. But most questions are about
new mainboards, while I am thinking about buying a used xeon workstation with this
http://ark.intel.com/products/36783 (http://ark.intel.com/products/36783) chipset
in a 2 PCIe x16 con guration. There is already an old Nvidia Quadro 4000 card in
there and I want to add a 1080 8gb.
Is there anything that wouldn’t work in this setup?
Tim Dettmers says

It would work just ne. PCIe guarantees downwards compatibility, so the GTX
1080 Ti should work just ne. The CPU is a bit small, but it should run deep
learning models just ne. For other CPU intensive work it will be slow, but for
deep learning work you will probably get the best performance for the lowest
price. One thing to notice is that you cannot parallelize across a Quadro 4000
and a GTX 1080 Ti, but you will be able to run separate models just ne.
Yuan says
Thank you for the excellent post. I have a few questions about PCI slots.
1) I found most pre-built workstations have only one PCIe x16 and some PCIe x4
and x1. Can GTX 1080ti work well with x4 or x1?
2) Do you have recommendation on pre-built systems with two PCIe x16 slots? (I
prefer not to build one from scratch, but okay with simple installation like adding
GPUs and RAMs.)
Tim Dettmers says

I have seen some mining rigs which use x1, but I do not think they support full
CUDA capabilities with that or you need some special hardware to interface
these things. It probably also breaks standard deep learning software. The best
is just to stick to x16 slots.
Haris Jabbar says

Hello Tim
I stumbled across this news item about a pair of cryptocurrency-speci c GPU SKUs
by NVIDIA. (http://wccftech.com/nvidia-pascal-gpu-cryptocurrency-mining-price-
specs-performance-detailed/ (http://wccftech.com/nvidia-pascal-gpu-
cryptocurrency-mining-price-specs-performance-detailed/))
With a price tag of 350$ for a GTX1080-class card, do you think it’s a good buy for
deeplearning? The only downside is no video ports, but that doesn’t matter for DL
anyway.
Thanks
Haris
Tim Dettmers says

They will be a bit slower, but more power ef cient. If you use your GPUs a lot,
then this will make a much more cost ef cient card over time. If you plan to use
your card for deep learning 24/7 it would be a very solid choice and probably
the best choice when looking at overall cost ef ciency.
Haris Jabbar says

Thank you for the quick reply!
Could you say why they would be slower? Because I want to weigh in the
speed vs power savings.

I’m always amazed about claims such as:

“Buy two of these cards (at 350$ each) and place it in this system and you
are looking at making over $5,000 a year in extra income! via
LegitReviews”
If so true, the opportunity cost for Nvidia or AMD is insane: why sell them
in the rst place, and not keep them to mine cryptocurrency themselves in
giant farms for their shareholder’s immediate bene t ?
I’d be a shareholder, I’d be furious at the management for such a lousy

decision
Tim Dettmers says

This happened in the past, then suddenly dif culty stagnated while
hardware increased and the GPUs were worthless. The last time this
happened from one week to the next. Mining hardware worth $15000
was worth $500 one week later. So if you look at it from a long-term
perspective, going into cryptocurrency mining would not be a good
strategy for NVIDIA or AMD, it is just too risky and unstable.
Jan Kohut says

Hi, thanks for the article. I want to buy GTX 1060. Do you think there is any risk in
Hi, thanks for the article. I want to buy GTX 1060. Do you think there is any risk in
buying factory overclocked GPU? Can I be sure that the factory overclocked card
will be precise in computations? I already read in comments that the speed
difference isn’t noticeable but the prices in my country aren’t much different so for
me it’s basically the same to buy normal or overclocked…And also I would like to ask
if there is any difference between 8GHz and 9GHz GDDR5? The cards I’m
considering are:
Overclocked:
MSI GeForce GTX 1060 GAMING X+ 6G
MSI GeForce GTX 1060 GAMING X 6G
GIGABYTE GeForce GTX 1060 G1 Gaming
ASUS GTX 1060 DUAL OC 6GB
No overclocked:
ASUS GeForce GTX 1060 Turbo 6GB
ASUS GTX 1060 DUAL 6GB
Thaks for your opinion
Tim Dettmers says

The factory overclocks are usually in speci c bounds which are still safe to use.
So it should be ne. As you say the clock speed is no big difference, however,
the memory clock makes a big difference. So if the prices are similar I would go
for a 9GHz GDDR5 model
for a 9GHz GDDR5 model.
deeplearn9 says
Hi Tim,
What do you think of the following con g for computer vision deep learning and
scienti c computing (MATLAB, Mathematica)?
Intel – Core i7-6850K 3.6GHz 6-Core Processor

Asus – SABERTOOTH X99 ATX LGA2011-3 Motherboard
Corsair – Vengeance LPX 64GB (4 x 16GB) DDR4-2800 Memory
Samsung – 960 Pro 1.0TB M.2-2280 Solid State Drive
EVGA – GeForce GTX 1080 Ti 11GB SC2 HYBRID GAMING Video Card
SeaSonic – PRIME Titanium 850W 80+ Titanium Certi ed Fully-Modular ATX
Power Supply
I’m oversizing the PSU because I might want to add another 1080 Ti in the future.
Some questions I have:
1) What kind of cooling do you recommend for the CPU?
2) Do you hook your machine to an Uninterruptible Power Supply/Surge Protector?
ravi jagan says

There may be a problem with the samsung disks with linux . I have a dell
alienware aurora R6 and had dif culty installing linux. I saw the forums and no
one has been able to install on R6 – apparently samsung shows up as 0 bytes or
something. I am using tensor ow on windows10 on the same hardware. works
ne, the GPU is getting utilized but I dont know about performance.
Tim Dettmers says

This is a good point, Ravi. A quick google search shows that some people
have problems with PCIe SSDs under Linux. However, there are already
some solutions popping up, so it seems that xes and instructions how to
get it working are underway. So I guess it should work in some way, but it
may require more ddling before it works.
Tim Dettmers says

Looks good. Strong CPU and RAM components are useful for scienti c
computing in general and I think it is okay to invest a bit more into these for
your case.
1) For CPU cooling a normal air cooler will be suf cient; often CPUs run rather
cool (60-70 C) and do not need more than air cooling.
2) Not really needed; I use a power strip with surge protector and that should
protect your PC from most things which can cause a power surge
Vick says
Hello, nice guide, based on which I bot a 1050 Ti. But going thru’ the installation of
drivers/CUDA etc., I came across this page: https://developer.nvidia.com/cuda-gpus
(https://developer.nvidia.com/cuda-gpus)
Here, 1050 Ti is not listed as a supported GPU. Am I stuck with a useless card?
When I try to install the CUDA 8 Toolkit (it always gives me an “Nvidia installer
failed” error. Cannot get past it. Also, do I need V Studio installed? I am trying to
install as per instruction here: http://docs.nvidia.com/cuda/cuda-installation-guide-
microsoft-windows/index.html#axzz4kPWtoq7o
(http://docs nvidia com/cuda/cuda installation guide microsoft
(http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-
windows/index.html#axzz4kPWtoq7o)
I need Tensor ow to work with my GPU, that is all. Any advice? Thanks.
Tim Dettmers says

The GTX 1050 Ti supports CUDA ne, the problem is that you probably need
the right compiler to make everything work. So yes, if you are missing the right
visual studio then this is the missing ingredient.
Vick says
Thanks Tim.
Am I supposed to install the full Visual Studio compiler, or will the VS 15

Express edition do?
The Nvidia CUDA 8.0 toolkit will not install; always exits with an error,
“Nvidia installer failed” — and the VS 15 Express exited with an error
(“some part did not install as expected”). Looks like this has the potential to
turn into a nightmare Any ideas anyone? Thanks I am running Win7Pro
turn into a nightmare. Any ideas, anyone? Thanks. I am running Win7Pro,

SP1. Trying to get this EVGA GPU working for TF.
Ahmed Kalil says

Hi Tim,
can you comment on this build: https://pcpartpicker.com/list/nThsQV
(https://pcpartpicker.com/list/nThsQV)
Thank You
Tim Dettmers says

This is a pretty high-cost rig. For deep learning performance, it will be not
necessarily better than a cheaper rig, but you will be able to do lots of other
stuff with it, like data science, Kaggle competitions and so forth. Make sure the
motherboard supports Broadwell E out of the box
motherboard supports Broadwell-E out of the box.
Ahmed Kalil says

Thank you Tim very much, i really appreciate your support
Jay Karimi says

I am in the “getting started in deep learning but serious about it ” category; I have
completed andrej karpathy’s and andrew ng’s course. I am primarily interested in
computer vision(kaggle) and reinforcement learning (openai gym). I am looking to
build a deep learning pc, here is my parts list:
https://pcpartpicker.com/list/n7KZm8 (https://pcpartpicker.com/list/n7KZm8)
Should I keep the gtx 1070 or should I spend the extra $250 to get the gtx 1080 ti?
Will my current cpu be able to support the gtx 1080 ti comfortably?
Tim Dettmers says

Looks like a very solid, high-quality low-price rig. I think the GTX 1070 will be
enough for now. You will be able to run most models, and those that you cannot
run you could run in 16-bit precision. If you have the extra money though, the
GTX 1080 Ti will give you a very solid card which you will probably not need to
upgrade even after Volta hits the market (although selling a GTX 1070 and
getting a GTX 1170 would not cost much either). So both choices are good. The
GTX 1080 Ti is not necessary but would add comfort (no precision ddling) and
you will be good with it for at least a year or two.

Hi Tim,
The new AMD Vega Frontier Edition card comes with 25 TFLOPS of FP16 compute
and 480 GB/s memory throughput. AMD pushes the card primarily for Graphics
workstations and AI (???) Is there a framework that would scale in a multi GPU
workstations and AI (???). Is there a framework that would scale in a multi-GPU

environment and supports AMD’s technology (OpenCL I presume)?
Thank you
Tim Dettmers says

Currently, I am not aware of libraries with good AMD support, but that might
change quickly. The specs and price are pretty good and there seems to be
more and more effort put into deep learning on AMD cards. It might be
competitive with NVIDIA by the end of the year.
Greg says
Hi Tim,
I want to rst thank you for all your awesome posts. You de nitely rock!
I have what I think is a quick question for you
I have what I think is a quick question for you.
In regards to deep learning and GPU processing – I have one Titan X. What’s your
opinion on either keeping and adding to the Titan X getting 3 more or selling the
Titan X and going with 4 of something more modern and affordable like the 1070 or
? What do you think – considering function and price.
Thanks in advance for your advice.
Tim Dettmers says

I think this depends on what do you want your cards to use for. If your current
models saturate your Titan X then it might make sense to stick to more Titan
Xs. However, if you memory consumption per model is usually much lower than
that it makes sense to get GTX 1070s. This should be a good indicator of what
kind of card would be best for you. Also consider to keep the GTX Titan X and
buy additional GTX 1070s. You will not be able to parallelize across all 4 GPUs,
but this might be a bit cheaper option.
Greg says
I’m thinking future too….and the GTX 1070 may need to be upgrade to ll
the shoes of the Titan X ; however, I hear the GTX 1080 Ti might be a good
alternative to the Titan x across the performance board.
Thoughts on this Tim?
Greg
Tim Dettmers says

If you can wait until Q1/Q2 2018 I would stick with any of the
mentioned GPUs and then upgrade to a Volta GPU in Q1/Q2 2018.
The problem with GTX 1080 Ti is that it will lose quite some value
once the Volta GPUs hit the market. You could get GTX 1080 Ti and
sell them before Volta comes out, but I think upgrading to Volta
directly might be smarter (if you can afford the wait).
Anas Koara says
Anas Koara says

hi Its my rst deep learning project and the only gpu I could nd is
http://www.gpuzoo.com/GPU-AFOX/GeForce_GT630_-_AF630-1024D3L1.html
(http://www.gpuzoo.com/GPU-AFOX/GeForce_GT630_-_AF630-1024D3L1.html)
What do you this about is it suitable for Neural machine translation

my data about 4 gb and 1 million sentence.
Tim Dettmers says

It is a bit tight. You might be better off using a CPU and a library with good CPU
support. I heard facebook has quite good CPU libraries. They might be
integrated into PyTorch, but I am not sure.
Robin Colclough (http://www.viewpoint-3d.com) says

Actually, AMD have already successfully ported over 99% of the NVidia deep
learning code to their Instinct GPUs, so compatibility will not be a problem.
As Facebook have recently open-sourced their CAFFE2 deep learning code, that
will also be available when using AMD GPUs.
Without competition, AI will not push ahead, nor will price drop enough to spread AI
use. As such, AMD’s massive investment in AI and deep learning is crucial to our
societies.
“AMD took the Caffe framework with 55,000 lines of optimized CUDA code and
applied their HIP tooling. 99.6% of the 55,000 lines of code was translated
automatically. The remaining code took a week to complete by a single developer.
Once ported, the HIP code performed as well as the original CUDA version.”
https://instinct.radeon.com/en-us/the-potential-disruptiveness-of-amds-open-
source-deep-learning-strategy/ (https://instinct.radeon.com/en-us/the-potential-
disruptiveness-of-amds-open-source-deep-learning-strategy/)
“Today Facebook open sourced Caffe2. The deep learning framework follows in the
steps of the original Caffe, a project started at the University of California, Berkeley.
Caffe2 offers developers greater exibility for building high-performance products
that deploy ef ciently.”
https://techcrunch.com/2017/04/18/facebook-open-sources-caffe2-its- exible-
deep-learning-framework-of-choice/
(https://techcrunch.com/2017/04/18/facebook-open-sources-caffe2-its- exible-
deep-learning-framework-of-choice/)
Have you considered using the new breakthrough AMD technology for AI, from the
Ryzen CPU’s aimed at parallel processing, to the new AMD Instinct series of GPUs?
These will offer superior power and facilities to NVidia and at least 30% lower cost
These will offer superior power and facilities to NVidia and at least 30% lower cost,
and have been designed to be ready for future AI computing needs, being much
more scalable than NVidia technology.
Launched June 2017: “AMD’s Radeon Instinct MI25 GPU Accelerator Crushes
Deep Learning Tasks With 24.6 TFLOPS FP16 Compute”
“Considering that EPYC server processor have up to 128 PCIe lanes available, AMD
is claiming that the platform will be able to link up with Radeon Instinct GPUs with
full bandwidth without the need to resort to PCI Express switches (which is a big
plus). As we reported in March, AMD opines that an EPYC server linked up with
four Radeon Instinct MI25 GPU accelerators has roughly the same computing
power as the human brain. ”
Read more at :
https://hothardware.com/news/amd-radeon-instinct-mi25-gpu-accelerator-deep-
learning-246-t ops-fp16 (https://hothardware.com/news/amd-radeon-instinct-
mi25-gpu-accelerator-deep-learning-246-t ops-fp16)
https://instinct.radeon.com/en-us/product/mi/ (https://instinct.radeon.com/en-
us/product/mi/)
Tim Dettmers says

Indeed, this is a big step forward. The hardware is there, but the software and
community is not behind it fully yet. I think once PyTorch has full AMD support
we will see a shift. I think AMD is getting more and more competitive and soon
it will be a great alternative option if not the better option over NVIDIA GPUs.
Eric Perbos-Brinck (http://www.fast.ai) says

I’m a fanboy of AMD GPU’s + FreeSync monitor combo for gaming, insane
value vs Nvidia.
But when it comes to choosing a GPU for your Deep Learning personal station
**TODAY**, there’s no possible hesitation: Nvidia with CUDA/CUdnn all the
way, from the GTX 1060 6Gb to the 1080Ti/Titan Xp.
Building a Deep Learning stable rig is complex enough for beginners (dealing
with Win vs Linux, Python 2.7 vs 3.6, Theano vs TensorFlow and so on), no need
to add a layer of cutting-edge tuning with AMD “work-in-progress”
Now the moment AMD eco-system is truly operational and crowd-tested, I’ll be
the rst one to drop Intel/Nvidia to return to AMD with a Ryzen 1800x/Vega
10.

No need to wait til the end of the year, OpenMI is not just an option to PyTorch, but
PyTorch is now available for AMD GPUs, as you can read below.
OpenMI offers the bene ts of source code availability, meaning users can ne-tune
the code to best t their needs, and also improve the code base.
For those starting AI / Deep-learning projects, AMD offers a fully functional

alternative to NVidia with a 30 – 60% cost saving. Not forgetting that AMD have
many AI projects using their technology.
PyTorch :-
“For PyTorch, we’re seriously looking into AMD’s MIOpen/ROCm software stack to
enable users who want to use AMD GPUs.
We have ports of PyTorch ready and we’re already running and testing full
networks (with some kinks that’ll be resolved). I’ll give an update when things are in
good shape.
Thanks to AMD for doing ports of cutorch and cunn to ROCm to make our work
easier.
permalinkembedreport
[–]JustFinishedBSG
I am very very interested. I’m pretty worried by nvidia utter unchecked domination
in ML.
I’m eager to see your benchmarks, if it’s competitive in PyTorch I’ll de nitely build
an AMD workstation”
Source:-
https://www.reddit.com/r/MachineLearning/comments/6kv3rs/mopen_10_release
d_by_amd_deep_learning_software/
(https://www.reddit.com/r/MachineLearning/comments/6kv3rs/mopen_10_release
d_by_amd_deep_learning_software/)
AMD AI/Deep learning products now available:-

https://instinct.radeon.com/en-us/category/products/
(https://instinct.radeon.com/en-us/category/products/)
Tim Dettmers says

The software is getting there, but it has to be battle-tested rst. It will take
some time until the rough edges are smoothed. Also, a community is very
important, currently the OpenMI community is far too small to have any
impact. The same is true for PyTorch when compared to TensorFlow, but the
current trends indicate that very soon this may change. However, in the end,
the overall picture counts, and as of now I cannot recommend AMD GPUs
since everything is still in their infancy. I might update my blog post soon
though, to indicate that AMD GPUs are now a competitive option if one is
willing to handle the rough edges and live with less community support.

I don’t think you should be so fast to dismiss AMD’s solutions, Caffe is available and
tested, and so are many applications using OpenMI.
AMD is also offering free hardware for testing for interesting projects, but
regardless of that, researchers should start enquiring now, to see if what is on offer
meets or even exceeds their needs.
nikola rahman says

Hi Tim.
I currently have the following con guration:

GPU: GTX 1080
CPU: i5-6400 CPU @ 2.70GHz (Max # of PCI Express Lanes = 16)
Motherboard: H110M PRO-D (1 PCIex16)
RAM: 32GB
PSU: max. power 750W
I want to upgrade this machine with at least one more GPU for the moment and
gradually add more GPUs over the next couple of months. I want this machine to be
used by 4-5 people. I like this solution better than the one where everyone gets a
PC because one person can use more than one GPU if it’s available, and it’s cheaper
(probably). Can you recommend a motherboard and a CPU for this purpose and
maybe give your comment on this approach since I don’t have any experience?
Thanks!
Tim Dettmers says

There are no motherboards which are exceptional for 4 GPUs. They are all
expensive and have their problems. The best bet is you search on
pcpartpicker.com for 4-way SLI motherboards and select one with a good
rating and good price. Optionally you can look in this comment section (or the
comment section of my hardware blog post
(http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/) for
motherboards that other people picked. The CPU option is dif cult to
recommend. For deep learning, you will not need a big CPU, but depending on
what each user does (preprocessing etc) one might need a CPU with more
cores. I would go with a fast 4 core, or with 6+ cores.
Jeff (http://www.aivital.com) says

Hi Tim,
Thanks for all the good info.
A lot of the specs for the new Volta GPUs cards that will be coming out seems to
focus on game play. Maybe I missed it but have you seen any kind of break down
comparison with other GPUs like the Titan Xp or the 1050 Ti? I am just wondering if
it would be a good idea to wait for the Volta or just dive in for a couple of Xp’s now
it would be a good idea to wait for the Volta or just dive in for a couple of Xp s now.
What’s your thoughts?
Thanks,
Jeff
Tim Dettmers says

It is dif cult to say. The rumor mills point out Volta consumer around 2017 Q3,
but Ti and Titan cards in 2018 Q1/Q2. It is unclear if the cards also come with
the TensorCores that the V100 has. If not, then it is not worth waiting, if they
have those, then a wait might be worth it. In any case, you would wait for quite
some time, so if you need GPU power now, I would go with Titan Xp or GTX
1080 Ti (it is easier to sell those to gamers once Volta hits the market).
Massinissa says

Hi Tim,
Hi Tim,
Thank you for these valuable informations,
I am planning to build a 5 x 1080 GTX using a ASRock Z87 Extreme4 motherboard,
I was wondering if it is such a good idea to use usb pcie x1 to x16 riser to plug all the
gpus and save some money instead of buying a multiple x16 motherboard
Do you think this will have a big impact on my deepleaning processes?
Thank in advance
Tim Dettmers says

I am not sure if that will work. This is usually done for cryptomining but I have
never seen a successful setup for deep learning. You could try this and see if it
works or not and let us know. I am curious.

(

Hi Tim,
Nvidia announced that they will unlock the Titan xp performance. Does this help
with Deep Learning? It’s currently unclear to me what exactly they have unlocked, is
it FP64 and/or FP16?
Thanks and regards
Krishna says
Say, so if I get 2 1080 8GB gpus for training models, do I need a 3rd gpu to run 2
monitors? I was trying to gure out if I needed to actually get 3 1080s, or if I can get
away with just 2 1080s, and then one somewhat smaller gpu just to run my
monitors?
Tim Dettmers says

You just need the two GPUs. You can run on them deep learning models and
run monitors on them at the same time. I do it all the time with no problems!
Jamil says
Awesome guide Tim!
As many others are probably wondering, what’s your take on the new RX Vega
cards? And speci cally their FP16 performance for deep learning as opposed to an
Nvidia 1080ti for example?
For the price of a 1080ti you can almost buy two Vega 56. For someone planning to
use them for mostly Reinforcement Learning on an AMD threadripper platform,
would you recommend I get two 1080tis or four Vega 56’s?
Thanks!
Tim Dettmers says

The hardware is there, but not yet the software. Currently, the software and
community behind AMD cards are too weak to be a reasonable option. It will
probably become more mainstream in the next 6 months or so. So you can live
with trouble, be a pioneer in the AMD land and thus help many others with
contributing bug reports and problems, or you can make it easy and go with
NVIDIA — your choice.
Andres (http://solardynamo.org) says

Hi Tim,
I love your post and it seems to be one of the to-go resources for making a choice. I
wanted to ask you a question. I am doing an application where I need a LOT of
memory, and by a LOT I mean I was thinking about getting the Quadro P6000 which
has 24Gb of vram (the most of any nvidia card).
I know people tend to go either GForce or Tesla for machine learning, but I was
wondering if you knew of the Quadros being used for this purpose and if you have
any suggestions for me regarding this matter. If I go for a Tesla P100 I would “only”
have 16Gb
have 16Gb.
Thanks a lot!
Tim Dettmers says

Quadro cards are okay. The thing is, you don’t want to run expensive cards if
you can do it with less. Have you looked into 16-bit networks and did you
simplify your network and memory requirements to a suf cient point? I think
you can keep the memory in check with a little bit of engineering (or precise
GPU -> CPU and CPU -> GPU transfers).
Danita Halmick says

This website is known as a stroll-by for all of the information you wanted about this
and didn’t know who to ask. Glimpse right here, and you’ll positively uncover it.
navid ziaei says

hi
I am planning to build the following system for deep learning:
Mother Board: MSI X99A SLI PLUS
GPU: MSI GeForce GTX 1080 ( x 2)
CPU: Intel® Core™ i7-6800K Processor
RAM: DDR4 32GB (16GB x 2) 2400MHz
Case: Deepcool Tesseract Sw
Power: Green GP1050B
I wonder what is the best cooling system? is this system with dual GPU really needs
water cooling? in this case this case is suitable?
is there something like cpu liquid cooling blocks for gpu?
Tim Dettmers says

Air is suf cient for 2 GPUs. For 3 and 4 GPUs air can also be reasonable, but
Air is suf cient for 2 GPUs. For 3 and 4 GPUs air can also be reasonable, but
here water cooling makes more sense and will improve performance
considerably. For 2 GPUs I do not think it is worth it, because you will gain
almost no performance.
Spring says
Hello Tim
I am going to purchase a machine for deep learning in computer vision.

Could you let me know how do you think about the following con guration?
One(1) E5-2623 v4 2.60GHz Four-Core85W Processor

32GBDDR4ECCRAM(4x8GB)
One(1)NVidiaGeForceGTXTitanXP
One(1)512GBSSD
One(1)2TBHDD
Tim Dettmers says

Looks good. I do not think you really need ECC option for your RAM unless it is
the same price as a non-ECC option of course. Otherwise, this would be solid
main parts.
Mahmood says
Hi Tim,
Thank you for these valuable informations,
I am planning to start playing Deep Learning.

Is my laptop with GTX 1050 4GB is suf ceint?
My laptop con guration: https://support.hp.com/hk-en/document/c05370398
(https://support.hp.com/hk-en/document/c05370398)
Thank in advance
Tim Dettmers says

Yes this will be suf cient to peak a bit into deep learning. With that you will be
able to test depe learning on small datasets and you will be able to use deep
learning methods on many Kaggle competitions.
Mykhailo Matviiv says

Hi, Tim!
Thanks for such helpful article!
I’m in the ‘I started deep learning and I am serious about it’ category now
I own gtx960 4gb now and want to upgrade it to 1080ti. But I’ve heard that Volta
GPUs will hit the market soon so I’m confused now.
Will you recommend to buy 1080ti now or just wait for Volta GPU?
Tim Dettmers says

We will see the Volta Titan probably around February next year. The 1180 and
1170 would hit the market earlier probably. I would probably just get the GTX
1080 Ti, skip the early Voltas and then settle with a Volta Titan. That way you
do not have to wait with your GTX 960 which might slow you down for a few
months.
Mykhailo Matviiv says

Thanks for the fast advice Btw, is there any info about when Volta specs
will be revealed?
Nero Wang says

Hi, Tim, if I want to run 2 GPUs, is it necessary to run in PCIe x16 for each card or
also can run in PCIe x8 for each card?
Nero Wang says

OK, I found the answer.

http://timdettmers.com/2017/04/09/which-gpu-for-deep-
learning/#comment-14149 (http://timdettmers.com/2017/04/09/which-gpu-
Tim Dettmers says

Thanks for nding the answer yourself and also linking to it! It more people
would do this, the comment section would be much easier to read.
Ravi Kumar says

Hi Tim! Thank you very much for a detailed article on the GPUs. I am looking at GTX
1080 Ti for deep learning in computer vision. Based on your advice to one of the
questions above, I am treating all the brands equally. Now I have to decide on the
cooling. From this article – http://thepcenthusiast.com/geforce-gtx-1080-ti-
compared-asus-evga-gigabyte-msi-zotac/ (http://thepcenthusiast.com/geforce-gtx-
1080-ti-compared-asus-evga-gigabyte-msi-zotac/), looks like there are 3 cards with
inbuilt liquid cooling. One of them is air + liquid and the other 2 are blower + liquid.
So how does these hybrid ones work. Would both the cooling options work in
tandem ? or should we select one of them to work at a time ? If both of work in
tandem, would the blower create same noise as in the blower-only cards ?
Tim Dettmers says

Air + liquid and blower + liquid just refers to different designs here. The main
concept of liquid cooling is that liquids, such as water, have much higher
thermal conductivity and thus heat can “ ow” much quicker into the liquid
rather than the air and thus it ows away faster from the processors However
rather than the air and thus it ows away faster from the processors. However,
you still need to cool down the water and you do this by using cooling ns which
increase the surface area + some form of air cooling. Air here refers to an fan
(or blower if you like) that is directly attached to the GPU, while a blower is an
external air fan.
External air fans have the advantage that you can have more intricate cooling
n structures which will dissipate the heat quicker. Also, the fan can be larger.
With this, you can cool the water and thus the processor more ef ciently but it
will also take up more space.
I do not believe that there is a big difference between these designs, as long as
you have any liquid cooling you should see a big decrease in GPU temperatures.
So both options should be ne.
Wei Zhang says

What a great post!
Jalal Abdulbaqi says

Hi,
thank you very much for this valuable information. I just have one comment about
the compassion. You keep the comparison with in GeForce series only, what about
Quardo? I thing there are many interesting model that can be used either for
performance and price ef ciency.
Tim Dettmers says

Quadro performance is about the same as GeForce performance but Quadros

are signi cantly more expensive. I do not recommend Quadro or Tesla GPUs,
they are too expensive for their performance. If you are forced to buy a
Quadro, look up the GPU chip type and compare with the GeForce series to get
a performance comparison (also pay attention to the bandwidth, this is
sometimes different between Quadro and GeForce that use the same chip).
Peter says
Hello Tim, I have an old MacBook Pro with a NVIDIA GeForce GT 330M GPU. Will
that be good enough to get me through the rst few months learning of DL?
Tim Dettmers says

Unfortunately, it will not be enough. The compute capability it too low, meaning
that you cannot run most deep learning software which requires cuDNN.
There are some exceptions for software libraries which would still allow you to
run some models, but it would be limiting and it would also be quite slow. You
can get started on a CPU although it will take some while until models will be
trained fully, but it gives you a little feel of what it going on. If that is for you, you
might want to invest into a small, cheap desktop with GTX 1050 Ti or GTX
1060, or buy an external Thunderbolt GPU for your macbook.
Joe says
I came across Intel’s Movidius Stick (see https://developer.movidius.com/

(https://developer.movidius.com/) ) and was looking for a comparison how it
performs compared to GPUs. This post is the nicest comparison, I wonder whether
you could add a note or even do another post.
Cheers !
Tim Dettmers says

The problem with new hardware is software. I do not see the point of this stick
if it does not support CUDA, and since it is from Intel, it will probably not
support it. Nice gimmick, but probably better to get some other hardware.
new_dl_learner says
Hi Tim, my Macbook Pro has NVIDIA GeForce GT 330M graphics processor with
512MB of GDDR3 memory. Is this suf cient to learn about deep learning?
new_dl_learner says
Hello, I came across an article about the difference in performance between PCI3
3.0 x8 vs. x16.:
https://www.gamersnexus.net/guides/2488-pci-e-3-x8-vs-x16-performance-
impact-on-gpus (https://www.gamersnexus.net/guides/2488-pci-e-3-x8-vs-x16-
performance-impact-on-gpus)
They seem to concluded that “The difference is not even close to perceptible and
should be ignored as inconsequential to users fretting over potential slot or lane
limitations. ”
I guess they talked about the frame rate performance. Does the conclusion also
applies to deep learning with 1080Ti running at x16x16x16x16 vs. x8x8x8x8 vs.
x16x8x16x8? If so, I just get a CPU that supports 2-3 x16 lanes.
Shahab says
Hi Tim,
First of all, thank you so much for your fantastic article; it’s very informative and
unique.
I’d like to ask a question and would be grateful if you can help me somehow.
I plan to buy a GTX 1080 Ti for an object detection and transfer learning task using
the pre-trained model “rfcn_resnet101_coco”. As I know, adding this GPU will
hugely increase the model training/tuning execution time, which is now very time
consuming on my CPU (Core i7 3770k). In addition to the training process, at this
moment, when the tuned model wants to detect the object of interest in the image,
CPU loads up to 100 percent, though the detection process is very slow (maybe
taking a minute to be done).
My question is that does this hardware upgrading also boost the
prediction/inference execution time?
Thank you
Tim Dettmers says

Yes, inference will be much faster too. If you only run one image at a time for
prediction the GPU can not be utilized fully, and thus 100 images one-by-one is
much slower than a single batch of 100 images, but you should still see a huge
increase of 5x to 10x if you do it one-by-one when compared with the CPU.
Shahab says
Hi Tim,
Thank you so much for your valuable help.
William Simmons (http://www.spoatech.com) says

Hi Tim,
I’ve been building a system that has an x299 motherboard with 128GB Ram and i9
core processor. I bought two Titan Xp GPUs to use for Deep Learning Network
training. In my lab, I wound up with an Nvidia GeForce 1080Ti card without a home.
The motherboard has an additional PCIe slot equivalent to the two slots occupied
by the Titan cards. Would I be better off to add this card to the two Titan Xp cards
already there for some additional umph, or would there be issues with cards which
are different, driverwise or otherwise? I am using Ubuntu 16.04 LTS operating
system.
Cheers,
Bill
Tim Dettmers says

Hi Bill, you should have no problems with pairing Titan Xp and GTX 1080Ti
cards. The only problem is that you cannot parallelize the system across all
three cards, only over the two GTX Titan Xps.
William Simmons (http://www.spoatech.com) says

So, there would really be no advantage to the third card then, right?
Cheers,
Bill
Tim Dettmers says

You can still run single GPU jobs on the third card which is still quite
useful!
Aerylia says
Hey!
Thanks for the great overview post.

Could you give me some advice about some problems i am having with my 660 Ti?
Could you give me some advice about some problems i am having with my 660 Ti?
I am an ai student and have to train some neural networks for assignments every
once in a while, but i have never been able to install the well-known python libraries
in a way that they would work; caffe, theano, keras and tensor ow.
The best indication i got was when theano gave me (an actual informative) warning,
stating that it saw my 660ti, but could not use the CuDNN optimizer. As i
understand from the internet, my gpu should have support for it, but apparently it
doesnt work.
Since i have swapped around all of the (hand-me-down) components in my

computer, except for the gpu, in the last few years, I expected that something was
wrong with the gpu and was looking for a cheap, but functional new gpu, like the
1060 6gb that your post advises me to get.
As a student, a new gpu is quite the investment and reading the comments, i
thought: why not ask?
So do you have any ideas about why i cannot get any neural network libraries to run
on my gpu? They do work on the cpu and other libraries for the gpu do work, like
openMP.
Do you think that getting a new gpu would solve my weird problems?
Thanks!
Abhinav Mathur says

Hi,
Can you suggest some of the gaming desktops available at costco.

Ex: https://www.costco.com/ASUS-ROG-G20CB-Gaming-Tower—Intel-Core-i7—
8GB-NVIDIA-Graphics.product.100347734.html (https://www.costco.com/ASUS-
ROG-G20CB-Gaming-Tower---Intel-Core-i7---8GB-NVIDIA-
Graphics.product.100347734.html).
https://www.costco.com/CyberpowerPC-SLC3200C-Desktop—7th-Generation-
Intel-Core-i7—8GB-NVIDIA-GeForce-GTX-1080-
Graphics.product.100333319.html (https://www.costco.com/CyberpowerPC-
SLC3200C-Desktop---7th-Generation-Intel-Core-i7---8GB-NVIDIA-GeForce-
GTX-1080-Graphics.product.100333319.html)
https://www.costco.com/CyberpowerPC-SLC3600C-Desktop—Intel-Core-i7—
11GB-NVIDIA-GeForce-GTX-1080Ti-Graphics.product.100350563.html
(https://www.costco.com/CyberpowerPC-SLC3600C-Desktop---Intel-Core-i7--
-11GB-NVIDIA-GeForce-GTX-1080Ti-Graphics.product.100350563.html)
Do you think they make sense for some one with limited budget trying on kaggle
competitions.
John O says

Dear Tim,
Dear Tim,
Thank you for your superb article in helping me get started on deep learning. I am
planning on getting the following con guration, and would like to seek your
feedback:
– Asrock Z270 Killer SLI/ac ATX Motherboard
– Zotac GTX 1070 Mini / Zotac GTX 1080 Mini
– Intel Core i5-7400
– Kingston 16GB DDR4 2133MHZ
– Kingston SSDNOW UV400 480GB
– Cooler Master G750M
– Fractal Design De ne R5 Blackout Edition ATX Chassis
Are there any problems in using Asrock & Zotac? I can’t seem to nd much reviews
on them for deep learning. Asrock seems to pack more features for the buck, and I
can extend it with more GPUs in the future, while Zotac Mini is cheaper than other
GTX 1080 brands with a more compact design.
https://www.asrock.com/mb/Intel/Z270%20Killer%20SLIac/index.us.asp
(https://www.asrock.com/mb/Intel/Z270%20Killer%20SLIac/index.us.asp)
https://www.zotac.com/my/product/graphics_card/GeForce-GTX-1080/all
(https://www.zotac.com/my/product/graphics_card/GeForce-GTX-1080/all)
What casing would you recommend for a quieter build, but without getting into
heat problems (potentially with 2 GPUs in the future)?
Thank you for your time.
Nade3r says
What is a Recommended GPU for a STRATUP Company with data-sets of more

than 10GB in size to apply DEEP LEARNING and ML models ?
Thank you
Tim Dettmers says

A GTX 1070 or GTX 1080 are often good for a startup. However, due to
bitcoin-mania, the prices on those cards can be high. It might be better to go
with used GTX Titan X (Maxwell or Pascal) or buy new Pascal if you nd some
which are competitively priced.
Kenneth Hesse says

Hi Tim,
Thanks for the post.
I have been studying machine learning theory for the past three months and I’m
itching to start experimentation. I’ll start with feed forward networks, but I am most
interested in sequence learning using recurrent networks. I want to experiment
with single and double layered networks using LSTM cells. I want to mess around
with bidirectional network architectures as well. But, from reading UCSD’s critical
review I’m left with the impression that researchers mostly use Titan XPs for these
types of things (Lipton, Berkowitz, Elkan [2015]). If so I’ll focus my experimentation
on feed forwards for now and have to wait for grad school to mess around with
sequence capable architectures. Do you know if the GeForce 1080 will be suf cient
for training recurrent networks?
Here is my build so far:

Case: Corsair 540 Mid-Tower
Motherboard: MSI X99A SLI PLUS ATX
Memory: Kingston Corsair Vengeance ,
??? 8×2 gb or 2×16 gb
GPU: ??? GeForce 1080 ( 1 to start, increase to up to 3 later)
CPU: intel i7
PCU: ??? 1300 watt or 1500 watt
I appreciate any advice you can give me.
Ken
Tim Dettmers says

Yes, the GTX 1080 will be more than suf cient for these architectures. If you
have just stacked LSTMs there is almost all datasets should work with a GTX
1080. If you have more complicated models you might run short, but you can
always try to make your network architecture more ef cient. A GTX 1080 will
de nitely get you very far, and for most researchers, that is all their need. For
my research, a GTX 1080 was always suf cient.
The build looks good. I think 1300 watts will be enough. The RAM is personal
preference, so both options are ne. Looks solid otherwise.
huichen says
HI Tim
Compare the parameters with GTX780 and GTX1050TI,
include the bandwidth,buswidth,cuda cores,TFLOPS,
GTX780 always better than GTX1050TI.

And they has a almost same price!
But in your opinion,the GTX780 < GTX1050TI in DL ?
Thanks!
Thanks!
Tim Dettmers says

Good catch! The problem is that you cannot compare bandwidth, bus-width,
CUDA cores and say “X is better than Y”. The architecture of the chip
determines how ef cient these things can be used and the GTX 10 architecture
is much more ef cient than the GTX 7 architecture. However, I have no hard
data, and it might still be that GTX 780 > GTX 1050 Ti, but unless somebody
has some deep learning benchmarks on, say, ResNet performance I would still
assume that GTX 1050Ti > GTX 780.
Maximilian says
Hello,
Thank you very much for the tutorial!
I am sorry if I have missed the point but I am a skint student haha.
So I have three options for a GPU that is under £200 a GTX 1060 6GB for ~£200 a
980ti 6GB for a similar price (but way better spec) or the 780ti 2GB for ~£60.
I am planning on running a facial detection model on the GPU.
What are the positives and negatives of these GPUs? Why wouldn’t I get the 780ti
which has the best (CUDA score – https://browser.geekbench.com/cuda-
benchmarks (https://browser.geekbench.com/cuda-benchmarks)) for the price – Do
I need more than 2GB (model is unlikely to be anywhere near that big?), what is
wrong with it being so old?
Any advice would be greatly appreciated.
Thank you
Tim Dettmers says

2 GB can be a bit short on the memory side. There are some sophisticated
facial detection models and they might need more RAM than that. I would go
for a GTX 980Ti if I were you. If you just run inference on pretrained models,
2GB should be enough though and then a GTX 780Ti would be a good choice
2GB should be enough though and then a GTX 780Ti would be a good choice.
new_dl_learner says
Hello Tim, how do the mobile version of Nvidia 1040, 1050, 1060 with/without Ti
perform? Are they not as good as the desktop version? I am considering to get a new
laptop. such as a Surface Book 2 or Lenovo’s. Thanks

Dear Tim,
Have you checked post and people working on developing Ethereum mining rig?
They consider up to 8-12 GPU (or pretend it is possible) in one unique machine.
This could be effective for multi-GPU simulations. Any thoughts about that?
Regards,
mph
Marian says
Thanks for the great article on GPU selection. Any chance you could offer updated
advice on system building? About a year ago, I built a system with 4 1080s based on
your recommendations. Now I am interested in building an 8 GPU system, but it
seems this will require some rather specialized hardware, and I am not sure what to
buy or if it is even practical for an individual or small company to build this kind of
machine. Also, I am curious about the new AMD processors that may have more
PCI lanes and whether they would be preferred over Xeon now. Also, it would be
great to see an article about getting visibility into PCI bus usage and whether it is a
bottleneck or not. This is something I have often wondered about.
Tim Dettmers says

Thanks for your feedback. Indeed these are some issues which have been
Thanks for your feedback. Indeed these are some issues which have been
raised again and again and I should write about it to clarify these points.
To give you a few short answers:

– I do not recommend 8 GPU systems; for optimal performance, they require
special code which no framework supports. Some frameworks should work
with less optimal performance algorithms without a problem though (less
performance would be about 7.9x speedup vs 6.5x speedup for convnets for
example). Another problems with such a system is the power supply (multiple
PSUs usually) and cooling (often special cooling solutions are needed). My
recommendation is to go with 4 GPU systems which are independent instead
of 8 GPU systems.
– AMD processors do not have any big advantage over Intel CPUs; the extra
lanes make almost no difference. Pay attention to cooling solutions rst. If you
have a liquid cooling solution and you parallelize most of your workloads, only
then it may make sense to look into more lanes. Usually, parallel algorithms
have a much larger effect than lanes. Good algorithm + few lanes > bad
algorithm + maximum lanes.
John A (https://laptop-review.co.uk/) says

thanks for the deep information
Tim Dettmers says

You are welcome!
Iago Burduli says

will have a 2×1080 ti bottleneck because of the 28 pci express line of Intel Core i7
7820x, as they will work 16+8 line scheme?
Tim Dettmers says

Its not a bottleneck, but if you run parallel algorithms, you will see a decrease in
Its not a bottleneck, but if you run parallel algorithms, you will see a decrease in
performance of 0-10% compared to a 16/16 lanes setup.

You can upgrade to 64 PCIe lanes by using the AMD Threadripper 1900x,
which now retails for US$ 449- on Amazon.
Combine this with AMD Radeon Instinct AI cards, and the Cuda-cross library,
and the performance/$ cost of AI reduces by over 50%.
Arturo Pablo Rocha Ramirez says

What card i would choose for start in deep learning a old/new gtx960/970 or a new
gtx 1050ti, the price of the three cards is very similar, or i can choose a more old gen
like 770 or 780.
Thanks for this good post!
Thanks for this good post!
ANkit Dhingra (http://KPIT.com) says

HI , Thanks for this well written article, Its is really helpful.

I am beginner in deep learning and going to start deep learning models soon, Data is
not too big rather small.
I have a 16 Core Xeon 2.4GHz CPU, 32GB RAM.
It comes preloaded with Nvidia Quadro k620 (2GB) GPU card.
Will this small GPU will work on small data and small deep networks for our
development phase, Or we have to buy a new GPU just now? Although by next
month we expect a new machine with better GPU.
Thanks in Advance
Ankit Dhingra
Paco says
Hello Tim, really useful staff in you site. Congratulations. I have been reading
around that the tesla v100 has 900GB/s memory bandwidth but when i go to aws
to read p3 specs it says EBS bandwith 1.5 – 14 Gbps. That is the difference in
performance due to virtualization that you were talking about? or this is another
metric? Many thanks
sharath says
Hi Tim,
I want to purchase a good GPU speci cally for Natural Language Processing
computations. Could you suggest me a good GPU in both nVidia and AMD GPU
types which can handle good amount of NLP tasks.
And also suggest the best API (OpenCL, OpenGL, Vulkan) for NLP purpose in AMD
GPUs for NLP computations for Microsoft OS and Linux OS types.
ArtVandelay says
Hello Tim,
Thanks for this write up, it’s been very helpful.

You say that for computer vision, for some of the leading models Titan Xp would be
recommended, also for computer vision with data sets larger than 250 gig(or
something along those lines).
So I’m asking, with regards to Deepmind and their PySC2 research. They are aimed
at image based machine learning AI’s.
So would a Titan Xp be better here? Or would a 1080 Ti suf ce?
Also, I imagine you know of Matthew Fisher and his blog. He’s done some very
interesting things with AI and computer vision.
I am interested in doing what he did with SC2. Intercept the Direct3D 9 API to
allow his AI to interact with the game. For something like this would a 1080 Ti be
okay? Or would an Titan Xp make a huge difference?
Thank you
Scott says

Tim what about the lastest Amazon AWS EC2 P3 instances based on the Volta
Tim what about the lastest Amazon AWS EC2 P3 instances based on the Volta
V100?
Is their price/perfomance competitive?
deepuser says
Hi Tim, Excellent article! Thanks very much for the detailed analysis!
We are looking to build a machine for running deep learning algorithms for
computer vision on relatively large data sets (many tera bytes). We are deciding
between a machine with 4 Titan Xps vs a machine with say 1 p100. Some
background: We envision this machine to be used not for experimentation or for
training different models on different GPUs, but this will be used to train “a” model
and do inference. So, for us, 4 GPUs are useful if we can use multi GPU
training/inference (data/model parallelization) and if at all it gives us a signi cant
performance improvement. The other option is to have a single high performance
GPU like the p100. But we certainly don’t require double precision accuracy. This
being the case, do you have any suggestions on either of the two options or
something else altogether? Thanks again for your advice!
Jack Marck says

What are your thoughts on the recent availability of Volta cards via AWS? By my
estimate, it would take about 245hrs on AWS On-Demand to break even with a
1080ti.
Jack Marck says

Hi Tim,
I’m interested in deep reinforcement learning for robotics. Since the training
episodes for this type of work is often done in real-time, is there any tangible bene t
to training on a beefy GPU?
Haider says
Hi Tim,
I have one 1080ti, and want to buy another 1080ti.
I am thinking to add a third GPU 1060-6GB, so that when I want to use the PC for
coding or other purposes during training NN on the other two 1080ti GPUs, it will
not be sluggish. And perhaps running this third GPU as well for NN training when I
don’t use the PC, so I can train another NN architecture in the meantime.
I am new to deep learning, but when I have used my only 1080ti for 3DSMax
rendering using VRays RT (GPU rendering), it became annoyingly slow.
The question is how many lanes my CPU does support:

Core i7-3770K spec says up to either 1×16 or 2×8 or 1×8 & 2×4 and they was not
clear what is the maximum lanes, but seems 16 lanes.
Motherboard GA-Z77X-D3H specs says (under the Expansion slots):

1 x PCI Express x16 slot, running at x16 (PCIEX16)
* For optimum performance, if only one PCI Express graphics card is to be installed,
be sure to install it in the PCIEX16 slot.

* The PCIEX8 slot shares bandwidth with the PCIEX16 slot. When the PCIEX8 slot
is populated, the PCIEX16 slot will operate at up to x8 mode.

* The PCIEX4 slot shares bandwidth with the PCIEX1_1/2/3 slots. The
PCIEX1_1/2/3 slots slots will become unavailable when a PCIe x4 expansion card is
installed
installed.
3 x PCI Express x1 slots

(The PCIEX4 and PCIEX1 slots conform to PCI Express 2.0 standard.)
This is a bit confusing. Now, can I connect the two gtx1080ti GPUs with PCI3.0
running at x8, and the other gtx1060 with PCI2.0 running at x4 ? Or the maximum
lanes is 16x, so no room for the third GPU?
Perhaps the Motherboard Block Diagram is more enlightening at page 8 here:

http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf
(http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf)
It seems the CPU indeed has maximum 16 lanes of PCI3.0 x16 , but the other extra
4 lanes are not coming from the CPU PCI Express Bus. It is coming from the Intel
Z77 chip which has its own PCI2.0 bus running at 4x.
What do you think?
Many thanks!
References:
https://ark.intel.com/products/65523/Intel-Core-i7-3770K-Processor-8M-Cache-
up-to-3_90-GHz (https://ark.intel.com/products/65523/Intel-Core-i7-3770K-
Processor-8M-Cache-up-to-3_90-GHz)
https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#sp
(https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#sp)
https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#support-manual
(https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#support-
manual)
http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf
(http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf)
EricPB says
Hi Tim,
NVidia just made a surprise announcement yesterday: they are releasing, for
immediate purchase, a Titan V priced at $3000 with specs almost identical to the
Tesla V100 PCIe ($10,000).
https://www.anandtech.com/show/12135/nvidia-announces-nvidia-titan-v-video-
card-gv100-for-3000-dollars (https://www.anandtech.com/show/12135/nvidia-
announces-nvidia-titan-v-video-card-gv100-for-3000-dollars)
For $3000, you can get either four units of a GTX 1080 TI ($750 a piece) or a single
Titan V.
Which option would you go for Deep Learning ?
Cheers,
E.
Amir H. Jadidinejad (http://amirj.github.io) says

Would you please review the functionality of the new Titan V in the eld of deep
learning and compare it with others such as 1080 ti?
Sunny says
Hi Tim,
Looking to build a Deep Learning PC and I am pretty new to the hardware side of
things. Any comments on the recently released Nvidia Titan V ? I am speci cally
interested in a 2 Titan V GPU setup (with Xeon or Ryzen), but have been reading
that this card has disabled SLI / NVLink. Will still be useful to shell out $6K to have a
powerful Deep Learning setup that will be viable for few years at least ?
Tim Dettmers says

I do not recommend buying Titan Vs. They are too cost-ineffective. I will write
an new blog post about this topic in the next days.
Asaf Oron says

Thank you very much Tim for this post.
I am a bit confused: when i look for a certain card e.g GTX 1060 i nd the card from
multiple vendors i.e. zotac or asus. some of them have the nvidia logo on the box
some dont. Are these the same cards ? how do i choose ?
I’m looking for my rst gpu for a windows pc capable of running convnets. i have a
budget of around 300 $. what would you recommend ?
Tim Dettmers says

Yes, there are the same. The vendors buy GPUs from NVIDIA and put their
own driver, fans etc on it, but essentially it is still a NVIDIA GPU in any case.
Mathieu says
Hello, and many thanks for your article which is very much useful and relevant,
especially the Xeon phi part (for me), and Intel’s behavior (BTW did you have
premium account which gives you the VIP/direct feedback? I have it and they are
rather fast an ef cient !). But also few comments are very interesting. I’m doing
private research in computer vision and AI, and I daily practice HPC.
Do you think you might update your paper with Quadro P6000 ? which is very much
relevant when we need to hold all the data in the device ! And then the Titan V ?
which is expensive too, and at the same time not that much expensive, if the
“business” is using FP16 and lots of tensor maths For example it might be
business is using FP16 and lots of tensor maths. For example, it might be
interesting the compare 1-2 Titan V with 3-4 1080 ti ! For deep learning, yes but
also for other algorithms which scale well without NVlink or even peer to peer
communication.
Best
Tim Dettmers says

I think these cards are too expensive for most people, so I will not cover them
here. I also do not have enough data on these cards to compare them directly.
Bruce Wang says

Brilliant analysis and conclusion.
Thanks a million.
Lety says
Hi Tim
I’m planning to by a 1070Ti, any opinion? it’s not in your benchmark analysis.
Thanks.
Tim Dettmers says

Since GPUs can be expensive due to cryptocurrency mining I would keep my

eyes open for a cheap GTX 1070/1070Ti/1080 and grab the rst cheap card
that you can nd. All these cards have similar cost/performance. If you can nd
a used GTX 1070/1080 this might be a bit cheaper and I personally would
prefer such a card compared to a brand new GTX 1070Ti.
Dave says
Hello Tim, how is the performance of 2-3 Nvidia 1080Ti installed in a computer
compared with one NVIDIA’s new TITAN V installed on the same computer?
Tim Dettmers says

For LSTMs the 3x GTX 1080Ti will be faster, for convolution the Titan V.
Overall I would prefer the GTX 1080Ti as it is much easier to run multiple
networks at the same time. This does not work well on a single GPU and is slow.
Jagan Babu says

Dear Tim,
Iam currently working on Deep learning data model training for image recognition
related.
I want create my own work environment and stuck up choosing between MSI X trio
1080 ti and EVGA 1080 ti FTW3 icx technology.I will running huge data models
>250GB.Iam looking for robust model without burning related issues and cost
factor do not matter.Please advise.
Tim Dettmers says

Those cards are very much the same. If you have benchmarks about how their
fans perform for cooling, go with the cooler fans. Other than that there will be
no difference. If you are worried about temperatures you might want invest in a
liquid cooled GPU. They will be much, much cooler (and also faster!).
Jagan Babu says

Thank you Tim…:)
jungju says
Hello, I have one question. Is it possible to do deep learning with only a PCIe slot
available (3.0×16) computer (I`ve bought a motherboard for my web server which
has only one PCIe slot) ?
Tim Dettmers says

Yes this will work without any problem.
KR says
Hi Tim,
What are your thoughts on putting 7 gpus in a single machine or is 4 gpus the
absolute limit? To make it work you would need a mobo with 7 PCIe slots, water
cooling to make the gpus t in a single slot and two power supplies.
See the link below:

http://rawandrendered.com/Octane-Render-Hepta-GPU-Build
(http://rawandrendered.com/Octane-Render-Hepta-GPU-Build)
Is this a bad idea?
Thanks for the great post.
Tim Dettmers says

I assume that you want to parallelize across 7 cards. If not there is no reason to
get a 7 GPU computer as there are many hardware problems and bottlenecks
associated with this. The build that you linked is not good for deep learning
because you have few PCIe lanes for parallelization and would be very slow If
because you have few PCIe lanes for parallelization and would be very slow. If
you want to get a 7 GPU system, the only way to go currently is with an EPYC
CPU and this motherboard: http://b2b.gigabyte.com/Server-
Motherboard/AMD-EPYC-7000 (http://b2b.gigabyte.com/Server-
Motherboard/AMD-EPYC-7000)
Even then there might be problems, but the motherboard above is the only
situation where you have enough PCIe lanes and avoid issues with PCIe
switches and software. With other CPUs/motherboard you would need to use
2 CPUs to get the required PCIe lanes for good parallelization performance as
you cannot do cuda-aware MPI parallelization across different PCI root
complexes that 2 CPU designs use.
In general, I would only advise you to build such a system if you really need
parallelization across 7 GPUs. If you can do your work with 4 GPUs I would
de nitely recommend going with a 4 GPU setup as there are no
software/hardware problems with that — a 4 GPU system is very
straightforward.

The MZ31-AR0-rev-10 doesn’t appear to feature 7x PCI-E x16 slots. It

features ve full length x16 PCI-E ports, and two x8 ports.
Reply (http://timdettmers com/2017/04/09/which gpu for deep
Bill Ross (http://phobrain.com) says

Would a Gtx 1030 be a valid option? I have a 1080 ti, but it is hardly used at all with
some of my models (0-13% of Volatile GPU-Util for predictions), and if Cudnn is
supported, I wonder if this would do for predictions, which run for ~8-10 days:
input shape: 3794

model le size: 93040732
keras/tensor ow
E.g. I loaded/ran 83 models in one session on the 1080 ti before running out of
memory (forgot to add keras.clear_session()), running 100M cases through each.
I have a dedicated python data prep thread with a queue 2 deep, bringing the
process to 150% .
Amin says
Hello, Thanks for excellent information.

I want to know how much VRAM i need?
Software:
Ubuntu 16.04 , Caffe , Mobilenet SSD
In CPU-only mode, training consume 6-8 GB of RAM.
But I have a weak system (with core i5 2400 and 10 GB 1333 RAM) right now and I
want to nd a good system build.
training speed = 75 iteration/Hour !
GTX 690 seems to be a good choice because of its 384 (GB/Sec) Memory
Bandwidth and 3 M cuda core clocks and also low price.
But I’m not sure about its 4 GB VRAM.
Also I will be very thankful if you give me an estimation about training speed with
this build:
GTX 690
Core i7 6700k (4-4.2 GHz, 8 MB cache)
16 GB DDR4 2133 dual
Tim Dettmers says

I would get a more modern GPU. For cheap GPUs I would recommend a GTX
1050Ti 4GB with 16-bit training, GTX 1060 6GB or a GTX 1070 (with 8GB). I
am not sure where the memory consumption comes from, but you want to
make sure that the dataset is not stored on the GPU
make sure that the dataset is not stored on the GPU.
The new CPU and RAM will hardly affect training performance at all (maybe
10-15% faster) and your money is better invested in buying a better GPU.
Amin says
Thanks.
I don’t have GPU yet.
(system = core i5 2400 and 10 GB 1333 RAM)
In this system, training consume 6-8 GB of RAM.
I want to know if I buy a new system that has GPU, how much VRAM will
be consumed or needed?
Because I want to decide about GPU model.
I don’t want to change software and network parameters. (Except using
CUDA!)
Do all of this RAM consumption will goes to VRAM?
I decided to buy Lenovo Y900 RE gaming tower.

GTX 1080
Core i7 6700k
16 GB DDR4 2133
dougr says
Tim,
You have previously recommended against the Titan XP due to the price delta…
however, with current availability and pricing of 1080 ti cards, most of the time, the
price differential from a 1080ti and a titanXP (direct from nvidia) is less than $200.
Independent of the current limbo status, if you needed a GPU in the near term,
what premium would you place on a titan XP vs a 1080ti? I’m patient enough to wait
for any nvidia announcements at GTC in late March, but lacking something there, do
not think I have the patience to wait for AMD to come around on the software side.
On a related note, there are a few articles out there discussing GDDR6 in the next
round of consumer cards, and estimating memory bandwidth based on predicted
bus widths and GDDR6 speeds. Most seem to assume that whenever nvidia gets
around to releasing their next architecture, that the initial top-end card will have
16GB of GDDR6 and memory bandwidth nearing 600GB/s (and well above that if
they use a wider bus in the later enthusiast card). Do you think those are reasonable
assumptions given the market dynamics for consumer cards (gaming, crypto, ML)?
Always impressed with your thoughts on the state of the market, Tim… great
resource.
Tim Dettmers says

I agree, if you can snatch a cheap Titan XP it is well worth it. There also seems
to be recent announcements that no new GPU will be introduced for the major
GPU conferences. The concept will be introduced, but not the GPU itself.
NVIDIA’s strategy is likely to introduce a gaming GPU so that deep learning
folks have to buy the Titan V if they want to deep learning. If this is really true,
then investing into a Titan XP makes a lot of sense.
I think your predictions for GDDR6 could make sense. Probably GDDR6 is also
cheaper to produce than HMB2 memory so I expect that we see a lot of cards
with it, but as mentioned above, it might be that we see no deep learning cards
with that. We will see in the next months.
Ilya Kavalerov (http://www.ilyakavalerov.com) says

How come there are no mentions of tesla cards? You can buy them used on eBay for
How come there are no mentions of tesla cards? You can buy them used on eBay for
much cheaper than the more mainstream gaming cards. Is the only reason that it
requires more hardware work since they are either too big for most regular PC
boxes, or require fan assembly since they are often sold air-cooled, or are there
some other costs I’m overlooking?
I also ask b/c in the literature I see more mention of GTX than teslas, but I don’t see
a reason for the preference other than ease of installation and potential cost saving
only at small scales.
Yuriy Filonov (http://yuriy lonov.wordpress.com) says

Hey Tim, perfect article! Thanks a lot for such a thorough review. Don’t you know
whether it’s possible to leverage several GPUs computational power for DNNs
through running them (cards) in a cross re mode? Won’t this give a GPU
parallelism “for free” with no need to make complex software adjustment?

If you look at the price of used Tesla cards and their performance then they are
almost always worse than any GTX GPU. I never mentioned it, because it is highly
unlikely that you can nd a used Tesla card which beats other cards at
performance/price. The cheapest Tesla K20 on eBay (6 months of data) went for
about $1000 and is equivalent to a GTX 680, the cheapest K10 went for $260, but
the 2nd cheapest was above $1000 and the K10 is slower than a GTX 680. A GTX
680 now goes for about $180. Other cards from the GTX 900 series are
signi cantly faster than the Tesla cards and most often also cheaper.
Tesla cards from the Fermi generation are affordable, but I would not recommend
them, because you cannot use standard deep learning software with them (also
they are very slow because they are so dated). On top you have of course the
cooling issues etc — so Tesla cards make only sense if you are exceptional lucky to
snatch one for a very low price.

AMD CrossFireX as well as NVIDIA SLI are built to exchange framebuffer

information across two GPUs. It seems that these interfaces can only be used for
that, mainly because they are to slow for ordinary parallel algorithms. This will
change with a new NVLink interface NVIDIA is building which will supersede the
PCIe 3.0 interface for GPU computing. However, as of now, you cannot get around
using the PCIe 3.0 interface for communication in multi-GPU applications.
Alex says
With AMD cards you can have as many cards as you have PCIe slots. You don’t need
to CrossFire them. You can just use them in parallel.
I’m really surprised AMD cards aren’t used for deep learning. I stumbled across this
article and was expecting some comparison, but saw everything was NVIDIA. I’m
coming from the Bitcoin world, where AMD is king for number crunching. That’s
why it’s surprising they’re not being used here. It’s like the complete opposite.
Mike Stone says

Hi Tim,
thanks for your sharing, and very good papers.
So the nvlink will improve the parallel performance much?
Tim Dettmers says

There is no optimized deep learning software for AMD cards, that is the only reason
why AMD cards are not usually used for deep learning. CrossFire or SLI is not used
in deep learning computations; it is really only the software part that makes AMD
cards unviable for deep learning.

AMD cards are used for deep learning, and there is a massive investment taking
that forward with Caffe and PyTorch support, and the new OpenMI API.
AMD are driving AI / Deep learning costs down 30 – 60% for researchers and
developers.
More details here:
More details here:-
https://instinct.radeon.com/en-us/category/products/
(https://instinct.radeon.com/en-us/category/products/)
I am not employed by AMD, and have no nancial interest other than seeing more
competition in the market.

Actually, that is no longer the case, last year AMD made available in depth support
for deep learning algorithms on its latest GPUs, which are considerably less costly
than NVidia ->”AMD took the Caffe framework with 55,000 lines of optimized
CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was
translated automatically. The remaining code took a week to optimize. Once ported,
the HIP code performed as well as the original CUDA version.
HIP is 99% compatible with CUDA, and provides a migration path for developers to
support an alternative and much less costly GPU platform. This is great for
developers who already have a large CUDA code base.
Early this year AMD decided to get even “closer to the metal” by announcing the
“Lightning Compiler Initiative.” This HCC compiler now supports the direct
generation of the Radeon GPU instruction set (known as GSN ISA) instead of
HSAIL.”
With the new Vega GPU, deep learning on AMD will go beyond what NVidia can
offer, and with much greater economies. In addition, the integrated Ryzen+Vega
APUs will provide powerful low cost deep learning on sun $600 laptops.
More details are available : https://www.extremetech.com/computing/240908-

amd-launches-new-radeon-instinct-gpus-tackle-deep-learning-arti cial-
intelligence (https://www.extremetech.com/computing/240908-amd-launches-
new-radeon-instinct-gpus-tackle-deep-learning-arti cial-intelligence)
and
https://www.extremetech.com/computing/240908-amd-launches-new-radeon-
instinct-gpus-tackle-deep-learning-arti cial-intelligence
(https://www.extremetech.com/computing/240908-amd-launches-new-radeon-
instinct-gpus-tackle-deep-learning-arti cial-intelligence)
Tim Dettmers says

These advances look very promising. I believe by the end of the year AMD GPUs
will be a good alternative option to NVIDIA GPUs. However, currently there are still
limitations in the breadth of software that you can use, and it is not clear to me how
reliable the entire codebase is. I imagine you could stumble into some problems
here and there, but the biggest problem is that currently there is only support for
Caffe. If it were TensorFlow or PyTorch support then I would readily recommend
AMD GPUs over NVIDIA’s as they are more cost ef cient. AMD is de nitely on the
horizon and might soon be a formidable opponent for NVIDIA!
horizon and might soon be a formidable opponent for NVIDIA!
Mathieu says
I don’t think AMD will compete with Nvidia. There is a huge work with the drivers
and software (and community), and I don’t see AMD catch up with Nvidia. It’s been
like this since decades almost. Nvidia was providing drivers for FreeBSD, and AMD
was just letting OS communities do the job, which is huge.
AMD might develop something interesting with integrated GPU CPU and memory
in one chip to address the memory bandwidth and latency problems, but without
some serious and long term involvement in software, it might be interesting only on
the paper and for marketing.
Tim Dettmers says

The market for AI hardware is so big, I think AMD just has to get into the market
The market for AI hardware is so big, I think AMD just has to get into the market
(just like Intel). I think AMD has a good chance to nd wider adoption in 2018.
Tim Dettmers says

It helps quite a bit, but it is too expensive to make sense for any consumer. NVLink
can be used to speed up supercomputers. But not all GPU clusters pro t from it.
Here at Microsoft, they have such advanced algorithms, that there is basically no
bottleneck on the PCIe bus and thus NVLink has no major bene t over normal PCIe.
Copyright © 2018 · Genesis Framework (http://www.studiopress.com/) · WordPress (http://wordpress.org/) · Log in

(http://timdettmers.com/wp-login.php)

Which GPU(s) To Get For Deep Learning

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Which GPU(s) To Get For Deep Learning

Hochgeladen von

Copyright:

Verfügbare Formate

2/21/2018 Which GPU(s) to Get for Deep Learning

Which GPU(s) to Get for Deep Learning:

deep learning approach (https://www.kaggle.com/c/crowd ower-weather-

Should I get multiple GPUs?

However, I also found that parallelization can be horribly frustrating. I naively

If you put value on parallelism I recommend using either Pytorch or CNTK.

Using Multiple GPUs Without Parallelism

So what kind of accelerator should I get? NVIDIA GPU, AMD

replicated this behavior in an isolated matrix-matrix multiplication example and sent it

Fastest GPU for a given budget

Evaluating GPUs via Their Memory Bandwidth

Bandwidth can directly be compared within an architecture, for example the

Cost Ef ciency Analysis

General GPU Recommendations

Amazon Web Services (AWS) GPU instances

Cost ef cient and cheap: GTX 1060 (6GB)

Update 2017-04-09: Added cost ef ciency analysis; updated recommendation with

Xeon Phi; added updates for the GTX 1000 series

Update 2015-04-22: GTX 580 no longer recommended; added performance

[Image source: NVIDIA CUDA/C Programming Guide

to-build-and-use-a-multi-gpu- 2015-03-09 In "Deep Learning"

timdettmers (http://scitinker.wordpress.com) says

Skydeep (http://notavailable) says

Thanks a lot Mr.Tim D

Tim Dettmers says

Lewis Cowles (@LewisCowles1) (http://twitter.com/LewisCowles1) says

timdettmers (http://scitinker.wordpress.com) says

Sometime it is good, but often it isn’t – it depends on the use-case. One

James Dang (@JamesDanged) (http://twitter.com/JamesDanged) says

timdettmers (https://timdettmers.wordpress.com) says

enedene (http://gravatar.com/enedene) says

What would you recommend?

timdettmers (https://timdettmers.wordpress.com) says

Hi enedene, thanks for your question!

I will update my post next week to re ect the new information.

enedene (http://gravatar.com/enedene) says

timdettmers (https://timdettmers.wordpress.com) says

timdettmers (https://timdettmers.wordpress.com) says

Thanks for your comment Monica. This is indeed something I overlooked,

Hi Tim. Thanks for this very informative post.

timdettmers (https://timdettmers.wordpress.com) says

2. What kind of modi cations in the original implementation could I do ( like 5 or 6

4. Is there a small scale implementation of this anywhere in github etc?

Also thanks a lot for the wonderful post.

timdettmers (https://timdettmers.wordpress.com) says

2015-02-04 at 18:36 (http://timdettmers.com/2017/04/09/which-gpu-for-deep-

timdettmers (https://timdettmers.wordpress.com) says

It looks like it is vertical, but it is not. I took that picture while my

gwern (http://www.gwern.net) says

(Duplicate paragraph: “I quickly found”.)

timdettmers (https://timdettmers.wordpress.com) says

vikasing (http://endex.wordpress.com/) says

2015-02-25 at 10:16 (http://timdettmers.com/2017/04/09/which-gpu-for-deep-

timdettmers (https://timdettmers.wordpress.com) says

vikasing (http://endex.wordpress.com/) says

timdettmers (https://timdettmers.wordpress.com) says

Nice and very informative post. I have a question regarding processor.

And thanks a lot for the wonderful post.

timdettmers (https://timdettmers.wordpress.com) says

Thanks for your comment, Dewan. An AMD CPU is just as good as a

elanmart (http://elantheyoung.wordpress.com) says

Hey! Thanks for the great post!

I have one question, however: