Cpu Based Ex A Scale

CPU Based Exascale Supercomputing Without ...
ABOUT
HOME
CONTRIBUTORS
COMPUTE
CONTACT
STORE
http://www.nextplatform.com/2016/02/23/cpu-b...
THE REGISTER
CONNECT
CONTROL
THE CHANNEL
CODE
ANALYZE
HPC
ENTERPRISE
CPU Based Exascale Supercomputing Without Accelerators

February 23, 2016
Rob Farber
Intel has been pursuing a long-term, multi-faceted set of

investments to create the processors and technologies
needed to build CPU-based supercomputers that can deliver
exascale levels of performance in a cost and energy efficient
fashion. Slated to become operational in 2016, the Trinity and
Cori supercomputers will be powered by the
next-generation Intel Xeon Phi processor, code name
Knights Landing (KNL), booting in self-hosted mode as a
standard, single processor node.
These pre-exascale supercomputers will deliver double-digit petascale performance (for
example, 1016 floating-point operations per second, or flop/s) without the use of attached
accelerators or coprocessors. They will certainly dispel the myth that the only path to exascale
supercomputing is through a heterogeneous node design with bus attached independent
computational devices.
Aside from delivering leadership-class computational performance, the Trinity and Cori
supercomputers will provide the HPC community with concrete data and valuable insights
into the productivity and other benefits of a massively-parallel SMP (Symmetric MultiProcessing) supercomputer environment based on CPUs as compared to the current
generation of heterogeneous systems such as Tianhe-2 (which utilizes Intel Xeon Phi
codename Knights Corner coprocessors) and ORNLs Titan supercomputer (accelerated by
NVIDIA K20x GPUs). The benefits of a CPU-only software environment at this scale will be of
particular interest as it eliminates the complexity and performance bottlenecks of the offload
data transfers required to run heterogeneous applications on accelerators.
Additionally, the Trinity and Cori systems will validate the energy, cost, and performance of
the self-hosted KNL computational nodes plus set the stage for the introduction of a variety of
other Intel technology investments into the exascale arena including innovations in memory
and storage (e.g. MCDRAM), networking (e.g. Intel Omni-Path Architecture (Intel OPA) and
on-chip Intel OPA, plus software elements that are part of the Intel Scalable System
1 of 11
20/05/2016 09:14
Framework. In particular, a new non-volatile memory technology co-developed by Intel and

Micron called 3D XPoint technology is poised to redefine what is meant by memory and
storage in high-performance computer architectures and may profoundly affect the cost,
capacity, and performance of future supercomputer designs.
Lessons learned from current petascale leadership-class supercomputers

The scientific, technological, and human benefits provided by the double-digit petascale
performance of top ranked, leadership-class supercomputers like Tianhe-2, Titan, Sequoia,
and Riken (the current four fastest supercomputers in the world) and other TOP 500
supercomputers have initiated a global race to build the first exascale supercomputer by the
end of this decade. A petascale supercomputer can deliver 1015 flop/s (floating-point
arithmetic operations per second) while an exascale system will provide 1018 flop/s.
Lessons learned from current leadership-class supercomputers show that the costs and
technological requirements to increase performance a further 30x beyond that of the Intel
Xeon Phi coprocessor powered Tianhe-2 supercomputer (the fastest supercomputer in the
world as of November 2015) that delivers 33.86 petaflop/s (or 3.386 x 1016 flop/s) to a true
exascale 1018 flop/s machine are substantial, sobering, but achievable due to investment in new
technologies.
Cost alone will require that many stakeholders participate in the funding and creation of an
exascale system of which early machines are anticipated to cost between $500 million and
$1 billion. And that means that any machine built will certainly be required to deliver
leadership class performance on a variety of stakeholder workloads. This only makes sense, as
packing a supercomputer with devices that only provide floating-point performance to
achieve a 1018 flop/s benchmark goal will be an expensive and meaningless effort if the rest of
the hardware does not provide sufficient memory, network, and storage subsystem
performance to run real applications faster than existing supercomputers. In short, the
stakeholders are going to want to get their moneys worth.
The National Strategic Computing Initiative (NSCI), the Department of Energys CORAL
initiative, and the Office of Science and the National Nuclear Security Administrations
longer-term investments in exascale computing under the DesignForward high-performance
computing R&D program are providing a viable path for the many stakeholders to reach that
exascale goal.
What matters then is: (1) how much the hardware cost of those flop/s can be reduced, (2) what
can be done with those flop/s, (3) how much flexibility is given to the HPC developer to exploit
that performance, and (4) what capabilities are provided to efficiently run important
applications that are not flop/s dependent (such as massive graph algorithms).
Balance ratios, discussed in this article, are the metrics used by the HPC community to cut
through the hype and make sense of a machines overall performance envelope to determine if
it can run a desired application-mix efficiently, plus get a general sense of the cost and power
requirements needed to procure and run the supercomputer. Examples include cost (dollars
per flop/s), power (flop/s per watt) and important subsystem ratios such as memory capacity
(bytes per flop/s), memory bandwidth (bytes/s per flop/s), memory transactions per second
(memory op/s per flop/s) along with similar ratios for network and storage capabilities.
2 of 11
20/05/2016 09:14
Breaking the exascale requires accelerators myth with CPUs

Heterogeneity has recently become a buzz word as it has successfully been utilized to increase
the flop/s rate of TOP 500 winning supercomputers. For example, Tianhe-2 utilizes three Intel
Xeon Phi coprocessors per node while the ORNL Titan supercomputer utilizes one accelerator
per node. Plugging a coprocessor or accelerator into the PCIe bus of a computational node has
both cost and power advantages for the current generation of systems based on commodity
motherboards. However the thermal, power, and size limitations dictated by the PCIe bus
standard imposes artificial boundaries on what can be physically installed on a PCIe card
including number of processing elements that deliver flop/s and amount of memory.
The Intel Xeon Phi family of processors were designed to support both heterogeneous
computing and native SMP (Symmetric Multi-Processing). This design duality has given Intel
the ability to claim the TOP 500 performance crown in the current leadership class
heterogeneous supercomputing environment with Tianhe-2 while also giving customers the
ability to take the lead in future systems that discard the hardware and bandwidth limitations
of PCIe based devices and heterogeneous computing environments.
The NNSA Trinity and NERSC Cori supercomputers will take the next step and provide
concrete pre-exascale (1016 flop/s) demonstrations that machines based entirely on self-hosted
SMP computational nodes can support a wide variety of stakeholder workloads. Cori will
support a broad user base as an open science DOE system serving thousands of applications
and hundreds of users. The Trinity system will be used for more targeted weapons stockpilerelated workloads. Together, the CPU-based Trinity and Cori supercomputers will break the
mindset that heterogeneous computing is required for exascale systems.
Together, the CPU-based Trinity and Cori supercomputers

will break the mindset that heterogeneous computing is
required for exascale systems.
In a very real sense, Amdahls law for stakeholder applications will dictate the cost and power
consumption of future exascale systems since serial sections of code are expensive to process
from a thermal, space, and cost of manufacturing standpoint. Parallel sections of code, in
particular the SIMD (Single Instruction Multiple Data) regions, can be processed via much
more efficient vector units that can deliver high flop/s per dollar and high flop/s per watt
performance.
Intel is incrementally tuning and refining the sequential processing power of the Intel Xeon
Phi processors. Seymour Cray famously quipped, If you were plowing a field, which would
you rather use: two strong oxen or 1024 chickens? In the exascale era, strong serial processors
(i.e. oxen) are prohibitively expensive, which means future supercomputer designs have to
provide just enough and no more serial processing capability. The Intel Xeon Phi processors
used in the Trinity and Cori supercomputers will take the next step towards thermally and cost
efficient exascale processors. The KNL processors used in these pre-exascale supercomputers
will deliver significantly more serial processing power than the previous generation Intel Xeon
Phi processors previously code name Knights Corner (KNC) used in Tianhe-2. In contrast,
GPUs rely on CPUs to perform any serial processing, thus forcing users to run in a
3 of 11
20/05/2016 09:14
heterogeneous environment.GPUs rely on CPUs to perform any serial processing, thus

forcing users to run in a heterogeneous environment.
The phrase, Knowledge is power takes on a new yet critical meaning in the exascale context
as greater parallel processing translates into a higher flop/s per watt ratio, and similarly a
lower cost per flop/s ratio. In other words, it will be the parallel floating-point hardware that
will make exaflop/s systems possible from a power and cost perspective. A big unknown is if
the new KNL sequential processing capabilities will provide all that is needed for future
exascale workloads. Literally having the ability to recompile and run on high-end Cori and
Trinity Intel Xeon E5 v3 processors (formerly known as Haswell) gives interested exascale
stakeholders the ability to compare performance and identify if any additional sequential
processing performance features need to be added to future Intel Xeon Phi processors.
Similarly, the parallel processing capabilities of the dual per core vector units on each of the
KNL processors can be evaluated to see what changes, if any, will be required to enhance
performance for key exascale HPC workloads. Gary Grider (Deputy Division Leader for HPC at
LANL) observes, In general, we seem to be back to trying to figure out how much of our
problems are Amdahl Law vs. throughput and vectorization again.
In general, we seem to be back to trying to figure out how

much of our problems are Amdahl Law vs. throughput and
vectorization again. Gary Grider (Deputy Division
LeaderHPC at LANL)
From a price performance perspective, Trinity and Cori are both expected to deliver similar
double-digit petascale performance as Tianhe-2 at roughly 1/3 to 1/5 the cost ($380 million
Tianhe-2, $128 million Trinity, $70 million Cori). A more precise ratio as well as the cost per
flop/s ratio for the Intel Xeon Phi processor nodes can be determined once these machines are
operational and the performance numbers published. Energy consumption is also decreasing
(e.g. Tianhe-2 17.6 MW, Trinity projected 15 MW, Cori projected 9 MW).
The Trinity and Cori machines give the HPC community the opportunity to evaluate if the
self-hosted, SMP design of the Intel Xeon Phi processor powered computational nodes deliver
more usable flop/s for key HPC applications. Software will play a key role in the success of the
Trinity and Cori supercomputers, which is why the Intel Scalable System Framework includes
portable programming standards such as OpenMP 4.0, Cilk+, and Intel Threading Building
Blocks (Intel TBB). These open-standards promise that performant portable codes can be
created to exploit the floating-point performance of both SMP and heterogeneous systems
architectures even at the exascale.
The exascale too much data problem

Visualization is a good example of the need to balance exascale floating-point performance
against other machine characteristics. Hank Childs, recipient of the Department of Energys
Early Career Award to research visualization with exascale computers and Associate Professor
in the Department of Computer and Information Science at the University of Oregon notes,
Our ability to generate data is increasing faster than our ability to store it.
4 of 11
20/05/2016 09:14
Our ability to generate data is increasing faster than our

ability to store it. Professor Hank Childs
Without performant memory and network subsystems within an exascale supercomputer, it

will be difficult or impossible to do something useful with the data generated by those 1018
flop/s capable processors. Jim Jeffers (Engineering Manager & PE, Visualization Engineering)
underscored this challenge in his editorial, CPUs Sometimes Best for Big Data Visualization,
when he wrote, a picture is worth an exabyte.
The freely available Intel developed open source Embree, OSPRay, and OpenSWR libraries
already demonstrate the power of a CPU-based homogenous, SMP based computing
environment for software defined visualization. In particular, Jeffers highlighted the
importance of memory capacity in his example of a single Intel Xeon processor E7 v3
workstation containing 3TB (trillion bytes) of RAM that was able to render a 12-billion particle,
450 GB cosmology dataset at seven frames per second. When presenting this example during
his IDF15 talk, Software Defined Visualization: Fast, Flexible Solutions For Rendering Big Data,
Jeffers commented that it would take more than 75 GPUs each containing 6 gigabytes of
on-GPU memory to perform the same scientific visualization task.
At exascale levels, in-situ visualization that runs the visualization software on the same
hardware that generates the data will probably be a requirement. Other scientific visualization
packages such as VisIT and Paraview, along with the Intel Scalable System Framework
visualization projects, have already installed hooks in their codes for in-situ visualization.
Noting this trend, Paul Navratil, manager of the TACC Scalable Visualization Technologies
group, reflects this growing view in the HPC community that, exascale supercomputers will
have to be performant visualization machines as well as efficient computational engines. He
also notes that it is up to organizations such as TACC to expand the realm of what is possible
for domain scientists so they can use capabilities like in-situ visualization.
At the exascale, in-situ visualization that runs the visualization software on the same
hardware that generates the data will probably be a requirement.
Jim Ahrens (founder and lead of Paraview at Los Alamos National Laboratory) says, There is a
renaissance in visualization and analysis as we figure out how to perform in-situ tasks
automatically. Christopher Sewell (LANL) points out the wide support for the VTK-M joint
project that includes LANL, ORNL, Sandia, Kitware, and the University of Oregon, all of whom
are working on exploiting the shared-memory parallelism of Trinity as well as other machines
to make in-situ visualization readily available to everyone.
Redefining memory and storage balance ratios

New memory technologies such as MCDRAM, or stacked memory, and a new non-volatile
memory technology that Intel jointly developed with Micron called 3D XPoint are poised to
redefine memory and storage balance ratios. MCDRAM will deliver significantly higher
bandwidth than conventional DDR4 memory while 3D XPoint promises, up to 4x system
memory capacity at significantly lower cost than DRAM, a hundred times lower latency than
todays best performing NAND, and write cycle durability that is 1000x that of NAND.
5 of 11
20/05/2016 09:14
Redefining what is meant by memory

Succinctly, MCDRAM will be used as high performance near-memory to accelerate
computational performance while fast and large capacity far-memory based on conventional
DRAM and Intel NVRAM DIMMs using 3D XPoint technology will greatly increase amount of
memory can be installed on a computational node.
Together, MCDRAM and Intel DIMMs based on 3D XPoint will redefine a number of key
supercomputer memory balance ratios including: (1) memory bandwidth (with MCDRAM), (2)
memory capacity (with Intel DIMMs based on 3D XPoint technology), and (3) cost per gigabyte
of memory. (Note: Intel DIMMs based on 3D XPoint technology will required a new memory
controller on the processor.)
Redefining what is meant by storage

Similarly, storage devices based on 3D XPoint technology will redefine storage performance.
For example, a prototype Intel Optane technology storage device running at the Intel
Developers Forum 2015 (IDF15) delivered a spectacular 5x to 7x performance increase over
Intels current fastest NAND SSD according to IOMETER, a respected storage performancemonitoring tool.
As with memory, storage technology using 3D XPoint technology has the potential to redefine
a number of key supercomputer storage balance ratios including: (1) storage bandwidth, IO
operations per second, and cost per terabyte.
Other innovative HPC uses

Innovative uses of 3D XPoint memory technology can make a big difference to exascale
supercomputing efforts, both by extending the use of in-core algorithms through greater per
node memory capacity and accelerating the performance of storage-based out-of-core
algorithms. 3D XPoint memory can also potentially be used as burst buffers as well to decrease
system cost and accelerate common use cases.
A burst buffer is a nonvolatile, intermediate layer of high-speed storage that can expedite bulk
data transfers to storage. Succinctly, economics are driving the inclusion of burst buffers in
leadership class supercomputers to fill the bandwidth gap created by the need to quickly
service the IO requests of very large numbers of big-memory computational nodes.
Checkpoint/restart operations are a common burst buffer use case.
Other technologies that may further reduce the cost of an exascale supercomputer
Intel is working on a host of other projects that will further reduce the cost of an exascale
supercomputer. Very briefly, publically disclosed projects include (but are not limited to):
The Intel Omni-Path Architecture, an element of the Intel Scalable System Framework,
allows denser switches to be built which will also help reduce the cost of the internal
exascale supercomputer network. In addition, Intel OPA promises a 56% reduction in
network latency, a huge improvement that can greatly benefit a wide-range of HPC
applications.
6 of 11
20/05/2016 09:14
A forthcoming Intel Xeon Phi processor code name Knights Landing is planned that will
have ports for Intels 100 Gb/sec Intel Omni-Path interconnect on the chip package. This
eliminates the cost of external interface ports while improving reliability.
A planned second generation of the Intel Xeon Phi processor code name Knights Hill will
be manufactured using a 10nm production process as compared to the Intel Xeon Phi
processor code name Knights Landing 14nm process. The result should be an even denser
and more power efficient Intel Xeon Phi processor compared to those used in the Trinity
and Cori procurements.
Both Knights Landing and Knights Hill Intel Xeon Phi processors include a host of
performance improving features such as the hardware scatter/gather capabilities
introduced in the AVX2 and AVX-512 instruction sets as well as out-of-order sequential
processing.
A broad spectrum of new technologies are redefining machine architectures from processors
to memory and network subsystems to storage. The Trinity and Cori supercomputer
procurements are poised to take the next step that will provide the HPC community with
valuable success stories and lessons learned that will be incorporated in the next generation of
possibly exascale leadership class supercomputers. In a very real sense, the self-hosted (or
bootable) mode of the Intel Xeon Phi family of processors used in the Trinity and Cori
supercomputers will concretely demonstrate that heterogeneous computing environments
using GPU and coprocessors are not an exascale requirement. That said, the dual-use Intel
Xeon Phi processor design unlike GPU accelerators lets the customer decide if they want
to build a self-hosted or heterogeneous exascale machine.
Visualization is an excellent use case to consider when trying to understand machine balance
and how the exascale too much data problem can be addressed. A community-wide effort to
support in-situ visualization is in process so domain scientists can better utilize data from
future leadership-class and exascale supercomputers. However, running both the simulation
and visualization software on the same computational nodes will stress both memory and
network subsystems, which highlights the importance of balanced machine capabilities such
as memory capacity and network capability. To meet this need, Intel will support on-chip Intel
Omni-Path technology to increase network bandwidth while decreasing both cost and
latency. Similarly, cost-effective 3D XPoint technology memory along with high-performance
MCDRAM are poised to literally redefine what is meant by memory and storage along with
capacity, performance and cost.
Rob Farber is a global technology consultant and author with an extensive background in HPC
and a long history of working with national labs and corporations engaged in both HPC and
enterprise computing. He can be reached at info@techenablement.com.
Share this:
Reddit
7 of 11
Facebook
30
LinkedIn
98
Twitter
Google
Email
20/05/2016 09:14
Similar Vein
University Gears Up to
Future Intel Chips Shine
First Burst Buffer Use at
Receive One of the First
in 180 Petaflops Argonne Scale Bolsters
Omni-Path Machines
Supercomputer
Application Performance
The Knights Landing
Where Will Future Xeon
Intel Stacks Knights
Xeon Phi Rollout Begins
Phi Chips Land?
Landing Chips Next To

Xeons
Categories: Compute, HPC

Tags: Cori, Knights Landing, Omni-Path, Top 500, Trinity
Oracle Engineers Its Own

InfiniBand Interconnects
Docker Trickles Down From

Hyperscale To Enterprise
4 thoughts on CPU Based Exascale Supercomputing Without

Accelerators
BlackDove says:
February 23, 2016 at 9:04 am
Im not sure why anyone in the business would expect that a heterogenous architecture could
be useful for exascale, given the nature of the datasets and software that theyll be using. All
the checkpointing would probably be made much more difficult by the heterogenity.
Besides, HPC specific CPUs currently dominate the heterogenous machines in terms of actual
8 of 11
20/05/2016 09:14
performance in HPCG, HPCG/HPL balance and the balance of bytes/FLOP. Current SPARC
XIfx designs are 1:1 HPCG/HPL and have excellent byte/FLOP ratios. SX vectors currently have
a 1:1 byte/FLOP ratio as well.
Since exascale is the convergence of HPC and big data, with massive memory amount and
bandwidth required, I think its pretty reasonable to think that the SPARC based Flagship2020
computing initiative will produce the first real exascale computer.
K is all SPARC CPUs and is still 4th on the Top500 and 1st on the Graph500. The computational
efficiency of those SPARC CPU machines is also greater than 90%, while GPU or current
heterogenous Xeon Phis like Tianhe-2 are around 55-65% efficient and perform much better
on HPL than HPCG, which is becoming less useful toward exascale.
My personal prediction is this: another SPARC machine using the already developed silicon
photonics, 3D memory(HMC has been in use on SPARC XIfx since 2014, long before KNL) of
some type will be the first real exascale computer and like K, it will cost over $1 billion.
PrimeHPC FX100(2014) is already scalabe to over 100 PFLOPS with only 512 racks. K was 4
years before that. A greater than 10x performance increase in that short of a time is impressive.
I do find it interesting that the architecture that has demonstrated performance in the form of
K, and is currently the most sophisticated HPC architecture in use isnt even mentioned in an
exascale CPU article. It gets almost no coverage.
According to an older article on here, Knights Hill may have fewer cores and more memory
and memory bandwidth per core, delete DDR4 DIMMs entirely and use only 3D
memory(HMC derived probably) making it look a lot less like KNL and more like SPARCfx.
Interesting that theyd go back in a sense.
Integrated interconnects like Tofu2(which is partially optical but not photonic) are already
alleviating a large amount of the bottleneck that interconnects pose. If Tofu3 includes silicon
photonics it should be interesting to compare to Omnipath with silicon photonics on Knights
Hill/Skylake Purley.
Reply
OranjeeGeneral says:
Interestingly to see that someone bets on SPARC I definitely wouldnt. SPARC future
with a DB company behind it always has been flaky sure there has been
commitment but how long will it last? Especially since the hardware manufacturing
game is getting more and more expansive and the ROI will get lower especially if
you manufacture at such a low scale as Oracle does. But I agree I never bought into
the hype of hetergeneous architecture. XeonPhi and AMD APU/fusion approach
look far more reasonable. If you need FLOP just add a very wide and fast vector
unit next to your CPU because thats what basically GPUs are.
Reply
9 of 11
20/05/2016 09:14
Barry Bolding says:
February 23, 2016 at 2:21 pm

Just a small note Trinity and Cori are
Cray XC Systems. Coral is a Cray Shasta system. Titan is a Cray XK7
system. The ability to drive future exascale performance is a
system-level problem, not just a processor-level problem.
Reply
John Barr says:
Without explicitly saying that heterogeneity is bad, the tone of the article suggests that this is
the case. However, for most workloads, heterogeneity is good as you can use different
components in an HPC system to best execute code sections requiring massively parallel or
serial support. If an Exascale system is to be more than a one trick pony it must have
heterogeneous computing elements in order to support the varying compute requirements of
a spectrum of applications.
Reply
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
Name *
Email *
Website
10 of 11
20/05/2016 09:14
Post Comment
Pages
Recent Posts
About
Google
Contact
Route with Homegrown Machine
Contributors
Learning Chips
Newsletter
IBM
Takes
Unconventional
Extends
Capabilities,
GPU
Targets
Cloud
Machine
Learning
Climate
Research
Learning
Onto
Pulls
Deep
Traditional
Supercomputers
In-Memory Breathes New Life
Into NUMA
IBM Throws Weight Behind Phase
Change Memory
Copyright 2015 The Next Platform
11 of 11
20/05/2016 09:14

Cpu Based Ex A Scale

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cpu Based Ex A Scale

Hochgeladen von

Copyright:

Verfügbare Formate

CPU Based Exascale Supercomputing Without ...