Beruflich Dokumente
Kultur Dokumente
I
T IS NO secret that artificial in-
telligence (AI) and machine CPU CPU NVLink
learning have advanced radi- PCIe
cally over the last decade, yet
somewhere between better al-
gorithms and faster processors lies the
increasingly important task of engi-
neering systems for maximum perfor-
PCIe Switches PCIe Switches
manceand producing better results.
The problem for now, says Nidhi
Chappell, director of machine learning
in the Datacenter Group at Intel, is that
AI experts spend far too much time GPU GPU GPU GPU
preprocessing code and data, iterating
on models and parameters, waiting for
training to converge, and experiment-
ing with deployment models. Each
step along the way is either too labor-
and/or compute-intensive. GPU GPU GPU GPU
The research and development
communityspearheaded by com-
panies such as Nvidia, Microsoft, Bai-
du, Google, Facebook, Amazon, and
Intelis now taking direct aim at the The design of the NVIDIA NVLink Hybrid Cube Mesh, which connects eight graphics
challenge. Teams are experimenting, processing units, each with 15 billion transistors.
developing, and even implementing
new chip designs, interconnects, and still in the early stages of research, de- originally invented to improve graph-
systems to boldly go where AI, deep velopment, or deployment. ics processing on computers, execute
learning, and machine learning have For another, theres no single de- specific tasks faster than conventional
not gone before. Over the next few sign, approach, or method that works central processing units (CPUs). Yet,
years, these developments could have well for every situation or AI-based a specialized design is not ideal for
a major impacteven a revolutionary framework. every application or situation. For in-
effecton an array of fields: automat- One thing that is perfectly clear: AI stance, a search engine such as Bing
ed driving, drug discovery, personal- and machine learning frameworks are or Google has very different require-
IMAGE BY AND RIJ BORYS ASSOCIAT ES, BASED ON SCH EMAT IC FRO M NVIDIA BLO G
ized medicine, intelligent assistants, advancing rapidly. Says Eric Chung, ments than the speech processing
robotics, big data analytics, computer a researcher at Microsoft Research: used on a smartphone, or the visual
security, and much more. They could Were seeing an escalating, insatiable processing that takes place in an auto-
deliver faster and better processing for demand for this kind of technology. mated vehicle or in the cloud. To vary-
important tasks related to speech, vi- ing degrees, systems must support
sion, and contextual searching. Beyond the GPU both training and delivering real-time
Specialized chips can significantly The quest for faster and better pro- information and controls.
increase performance for fixed-func- cessing in AI is nothing new. In re- In the quest to boost performance in
tion workloads, because they include cent years, graphical processing units these systems, designers and engineers
everything needed specifically for the (GPUs) have become the technology of are leaving no idea unexamined. How-
task at hand and nothing more. Yet, choice for supporting the neural net- ever, all the research revolves around
the task is not without its challenges. works that support AI, deep learning, a key goal: Specialized AI chips will
For one thing, theres no clear idea and machine learning. The reason is deliver better performance than either
about how to use silicon to accelerate simple, even if the underlying tech- CPUs or GPUs. This will undoubtedly
AI. Most chip designs and systems are nology is complex: GPUs, which were shift the AI compute [framework] mov-
ing forward, Chappell explains. In the engine, as well as the Azure cloud. This
real world, these boutique chips would allows teams to implement algorithms
greatly reduce training requirements The objective is to directly onto hardware, rather than po-
in neural networks, in some cases from build computation tentially less-efficient software. Chung
days or weeks to hours or minutes. This says the FPGAs performance exceeds
has the potential to not only improve platforms that deliver that of CPUs while retaining flexibility
performance but also slash costs for the performance and and allowing production systems to op-
companies developing AI, deep learn- erate at hyperscale. He describes the
ing, and machine learning systems. energy efficiency technology as programmable silicon.
The result would be faster and better needed to build To be sure, energy-efficient FP-
visual recognition in automated vehi- GAs satisfy an important requirement
cles, or the ability to reprocess millions AI with a level of when deploying accelerators at hyper-
of scans for potentially missed mark- accuracy that isnt scale in power-constrained data-
ers in healthcare or pharma. centers. The system delivers a scalable,
The focus on boutique chips and possible today. uniform pool of resources independent
better AI computation is leading re- from CPUs. For instance, our cloud al-
searchers down several avenues. These lows us to allocate few or many FPGAs as
include improvements in GPUs as well a single hardware service, he explains.
as work on other technologies such as This, ultimately, allows Microsoft to
field programmable gate arrays (FP- tween throughput-oriented GPU and scale up models seamlessly to a large
GAs), Tensor Processing Units (TPUs), latency-oriented CPU. For instance, number of nodes. The result is extremely
and other chip systems and architec- Nvidia has introduced a specialized high throughput with very low latency.
tures that match specific AI and ma- server called DGX-1, which uses eight FPGAs are, in fact, highly flexible
chine learning requirements. These Tesla P100 processors to deliver 170 chips that achieve higher performance
initiatives, says Bryan Catanzaro, vice teraflops of compute for neural net- and better energy efficiency with re-
president of Applied Deep Learning work training. The system also uses a duced numerical precision. Each
Research at Nvidia, point in the same fast interconnect between GPUs called computational operation gets more ef-
general direction: The objective is to NVLink, which the company claims al- ficient on the FPGA with the fewer bits
build computation platforms that de- lows up to 12 times faster data sharing you use, Chung explains. The current
liver the performance and energy effi- than traditional PCIe interconnects. generation of these Intel chips, known
ciency needed to build AI with a level of There is still an opportunity as Stratix V FPGAs, will evolve into more
accuracy that isnt possible today. for considerable innovation in this advanced versions, including Arria 10
GPUs, for instance, already deliver space, he says. and Stratix 10, he notes. They will intro-
superior processor-to-memory band- duce additional speed and efficiencies.
width and they can be applied to many New Models Emerge With the technology, we can build
tasks and workloads in the AI arena, in- Other approaches are also usher- custom pipelines that are tailored
cluding visual and speech processing. ing in significant gains. For example, to specific algorithms and models.
The appeal of GPUs revolves around Googles Tensor Processing Unit Chung says. In fact, Microsoft has
providing greater floating-point opera- (TPU) is a custom application-specific reached a point where developers can
tions per second (FLOPs) using fewer integrated circuit (ASIC) that is spe- deploy models rapidly, and without un-
watts of electricity, and the ability to cifically designed for AI applications derlying technical expertise about the
extend the energy advantage by sup- such as speech processing and street- machine learning framework. The ap-
porting 16-bit floating point numbers, view mapping and navigation. It has peal is the high level of flexibility. It can
which are more power- and energy-ef- been used in Googles datacenters for be reprogrammed for different AI mod-
ficient than single-precision (32-bit) or more than 18 months. A big benefit is els and tasks, Chung notes. In fact, the
double-precision (64-bit) floating point that the chip is optimized for reduced FPGAs can be reprogrammed on the
numbers. What is more, GPUs are quite computational precision. This trans- fly to respond to advances in artificial
scalable. The Nvidia Tesla P100 chip, lates into fewer transistors per opera- intelligence or different datacenter re-
which packs 15 billion transistors into tion and the ability to squeeze more quirements. A process that previously
a silicon chip, delivers extremely high operations per second into the chip, could take two years or more, now can
throughput on AI workloads associated which results in better-optimized per- take place in minutes.
with deep learning. formance per watt and an ability to use Finally, Intel is introducing Nervana,
However, as Moores Law reaches more sophisticated and powerful ma- a technology that aims to deliver un-
physical barriers, the technology must chine learning modelswhile apply- precedented compute density and high
evolve further. For now, There are ing the results more quickly. bandwidth interconnect for seamless
a lot of ways to customize processor Another technology aimed at advanc- model parallelism, Chappell says. The
architectures for deep learning, Cat- ing AI and machine learning is Micro- technology will focus primarily on mul-
anzaro says. Among these: improving softs Project Catapult, which uses field tipliers and local memory, and skip ele-
efficiency on deep learning specific programmable gate arrays (FPGAs) that ments such as caches that are required
workloads, and better integration be- underpin the widely used Bing search for graphics processing but not for deep