Beruflich Dokumente
Kultur Dokumente
ABOUT
HOME
CONTRIBUTORS
COMPUTE
CONTACT
STORE
http://www.nextplatform.com/2016/02/23/cpu-b...
THE REGISTER
CONNECT
CONTROL
THE CHANNEL
CODE
ANALYZE
HPC
ENTERPRISE
Rob Farber
1 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
2 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
In a very real sense, Amdahls law for stakeholder applications will dictate the cost and power
consumption of future exascale systems since serial sections of code are expensive to process
from a thermal, space, and cost of manufacturing standpoint. Parallel sections of code, in
particular the SIMD (Single Instruction Multiple Data) regions, can be processed via much
more efficient vector units that can deliver high flop/s per dollar and high flop/s per watt
performance.
Intel is incrementally tuning and refining the sequential processing power of the Intel Xeon
Phi processors. Seymour Cray famously quipped, If you were plowing a field, which would
you rather use: two strong oxen or 1024 chickens? In the exascale era, strong serial processors
(i.e. oxen) are prohibitively expensive, which means future supercomputer designs have to
provide just enough and no more serial processing capability. The Intel Xeon Phi processors
used in the Trinity and Cori supercomputers will take the next step towards thermally and cost
efficient exascale processors. The KNL processors used in these pre-exascale supercomputers
will deliver significantly more serial processing power than the previous generation Intel Xeon
Phi processors previously code name Knights Corner (KNC) used in Tianhe-2. In contrast,
GPUs rely on CPUs to perform any serial processing, thus forcing users to run in a
3 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
From a price performance perspective, Trinity and Cori are both expected to deliver similar
double-digit petascale performance as Tianhe-2 at roughly 1/3 to 1/5 the cost ($380 million
Tianhe-2, $128 million Trinity, $70 million Cori). A more precise ratio as well as the cost per
flop/s ratio for the Intel Xeon Phi processor nodes can be determined once these machines are
operational and the performance numbers published. Energy consumption is also decreasing
(e.g. Tianhe-2 17.6 MW, Trinity projected 15 MW, Cori projected 9 MW).
The Trinity and Cori machines give the HPC community the opportunity to evaluate if the
self-hosted, SMP design of the Intel Xeon Phi processor powered computational nodes deliver
more usable flop/s for key HPC applications. Software will play a key role in the success of the
Trinity and Cori supercomputers, which is why the Intel Scalable System Framework includes
portable programming standards such as OpenMP 4.0, Cilk+, and Intel Threading Building
Blocks (Intel TBB). These open-standards promise that performant portable codes can be
created to exploit the floating-point performance of both SMP and heterogeneous systems
architectures even at the exascale.
4 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
5 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
Other technologies that may further reduce the cost of an exascale supercomputer
Intel is working on a host of other projects that will further reduce the cost of an exascale
supercomputer. Very briefly, publically disclosed projects include (but are not limited to):
The Intel Omni-Path Architecture, an element of the Intel Scalable System Framework,
allows denser switches to be built which will also help reduce the cost of the internal
exascale supercomputer network. In addition, Intel OPA promises a 56% reduction in
network latency, a huge improvement that can greatly benefit a wide-range of HPC
applications.
6 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
A forthcoming Intel Xeon Phi processor code name Knights Landing is planned that will
have ports for Intels 100 Gb/sec Intel Omni-Path interconnect on the chip package. This
eliminates the cost of external interface ports while improving reliability.
A planned second generation of the Intel Xeon Phi processor code name Knights Hill will
be manufactured using a 10nm production process as compared to the Intel Xeon Phi
processor code name Knights Landing 14nm process. The result should be an even denser
and more power efficient Intel Xeon Phi processor compared to those used in the Trinity
and Cori procurements.
Both Knights Landing and Knights Hill Intel Xeon Phi processors include a host of
performance improving features such as the hardware scatter/gather capabilities
introduced in the AVX2 and AVX-512 instruction sets as well as out-of-order sequential
processing.
A broad spectrum of new technologies are redefining machine architectures from processors
to memory and network subsystems to storage. The Trinity and Cori supercomputer
procurements are poised to take the next step that will provide the HPC community with
valuable success stories and lessons learned that will be incorporated in the next generation of
possibly exascale leadership class supercomputers. In a very real sense, the self-hosted (or
bootable) mode of the Intel Xeon Phi family of processors used in the Trinity and Cori
supercomputers will concretely demonstrate that heterogeneous computing environments
using GPU and coprocessors are not an exascale requirement. That said, the dual-use Intel
Xeon Phi processor design unlike GPU accelerators lets the customer decide if they want
to build a self-hosted or heterogeneous exascale machine.
Visualization is an excellent use case to consider when trying to understand machine balance
and how the exascale too much data problem can be addressed. A community-wide effort to
support in-situ visualization is in process so domain scientists can better utilize data from
future leadership-class and exascale supercomputers. However, running both the simulation
and visualization software on the same computational nodes will stress both memory and
network subsystems, which highlights the importance of balanced machine capabilities such
as memory capacity and network capability. To meet this need, Intel will support on-chip Intel
Omni-Path technology to increase network bandwidth while decreasing both cost and
latency. Similarly, cost-effective 3D XPoint technology memory along with high-performance
MCDRAM are poised to literally redefine what is meant by memory and storage along with
capacity, performance and cost.
Rob Farber is a global technology consultant and author with an extensive background in HPC
and a long history of working with national labs and corporations engaged in both HPC and
enterprise computing. He can be reached at info@techenablement.com.
Share this:
7 of 11
30
98
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
Similar Vein
University Gears Up to
Omni-Path Machines
Supercomputer
Application Performance
Im not sure why anyone in the business would expect that a heterogenous architecture could
be useful for exascale, given the nature of the datasets and software that theyll be using. All
the checkpointing would probably be made much more difficult by the heterogenity.
Besides, HPC specific CPUs currently dominate the heterogenous machines in terms of actual
8 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
performance in HPCG, HPCG/HPL balance and the balance of bytes/FLOP. Current SPARC
XIfx designs are 1:1 HPCG/HPL and have excellent byte/FLOP ratios. SX vectors currently have
a 1:1 byte/FLOP ratio as well.
Since exascale is the convergence of HPC and big data, with massive memory amount and
bandwidth required, I think its pretty reasonable to think that the SPARC based Flagship2020
computing initiative will produce the first real exascale computer.
K is all SPARC CPUs and is still 4th on the Top500 and 1st on the Graph500. The computational
efficiency of those SPARC CPU machines is also greater than 90%, while GPU or current
heterogenous Xeon Phis like Tianhe-2 are around 55-65% efficient and perform much better
on HPL than HPCG, which is becoming less useful toward exascale.
My personal prediction is this: another SPARC machine using the already developed silicon
photonics, 3D memory(HMC has been in use on SPARC XIfx since 2014, long before KNL) of
some type will be the first real exascale computer and like K, it will cost over $1 billion.
PrimeHPC FX100(2014) is already scalabe to over 100 PFLOPS with only 512 racks. K was 4
years before that. A greater than 10x performance increase in that short of a time is impressive.
I do find it interesting that the architecture that has demonstrated performance in the form of
K, and is currently the most sophisticated HPC architecture in use isnt even mentioned in an
exascale CPU article. It gets almost no coverage.
According to an older article on here, Knights Hill may have fewer cores and more memory
and memory bandwidth per core, delete DDR4 DIMMs entirely and use only 3D
memory(HMC derived probably) making it look a lot less like KNL and more like SPARCfx.
Interesting that theyd go back in a sense.
Integrated interconnects like Tofu2(which is partially optical but not photonic) are already
alleviating a large amount of the bottleneck that interconnects pose. If Tofu3 includes silicon
photonics it should be interesting to compare to Omnipath with silicon photonics on Knights
Hill/Skylake Purley.
Reply
OranjeeGeneral says:
Interestingly to see that someone bets on SPARC I definitely wouldnt. SPARC future
with a DB company behind it always has been flaky sure there has been
commitment but how long will it last? Especially since the hardware manufacturing
game is getting more and more expansive and the ROI will get lower especially if
you manufacture at such a low scale as Oracle does. But I agree I never bought into
the hype of hetergeneous architecture. XeonPhi and AMD APU/fusion approach
look far more reasonable. If you need FLOP just add a very wide and fast vector
unit next to your CPU because thats what basically GPUs are.
Reply
9 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
Without explicitly saying that heterogeneity is bad, the tone of the article suggests that this is
the case. However, for most workloads, heterogeneity is good as you can use different
components in an HPC system to best execute code sections requiring massively parallel or
serial support. If an Exascale system is to be more than a one trick pony it must have
heterogeneous computing elements in order to support the varying compute requirements of
a spectrum of applications.
Reply
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
Name *
Email *
Website
10 of 11
20/05/2016 09:14
http://www.nextplatform.com/2016/02/23/cpu-b...
Post Comment
Pages
Recent Posts
About
Contact
Contributors
Learning Chips
Newsletter
IBM
Takes
Unconventional
Extends
Capabilities,
GPU
Targets
Cloud
Machine
Learning
Climate
Research
Learning
Onto
Pulls
Deep
Traditional
Supercomputers
In-Memory Breathes New Life
Into NUMA
IBM Throws Weight Behind Phase
Change Memory
11 of 11
20/05/2016 09:14